> On Nov 8, 2017, at 06:00, Yuya Nishihara <y...@tcha.org> wrote:
> 
>> On Tue, 7 Nov 2017 09:58:04 -0800, Durham Goode wrote:
>> I wish we had some easily reusable serializer/deserializer instead of 
>> having to reinvent these every time.  What's our reasoning for not using 
>> json? I forget. If there are some weird characters, like control 
>> characters or something, that break json, I'd say we just use json and 
>> prevent users from creating bookmarks and paths with those names.
> 
> Just about json. Using json (in Mercurial) is bad because it's so easily
> to spill out unicode objects without noticing that, all tests pass (because
> the whole data is ascii), and we'll get nice UnicodeError in production.

This issue can be prevented with diligent coding (and possibly a custom wrapper 
to convert unicode to bytes). Python 3 would also uncover type coercion.

> 
> Another concern is that encoding conversion can be lossy even if it goes
> with no error. There are n:m mappings between unicode and legacy encoding.
> For example, we have three major Shift_JISes in Japan, and the Microsoft one
> allocates multiple code points for one character for "compatibility" reasons.

This is the bigger problem. JSON doesn’t do a good job at preserving byte 
sequences unless strings are valid UTF-8. The most common way to robustly round 
trip arbitrary byte sequences through JSON is to apply an encoding to string 
fields that won’t result in escape characters in JSON. Base64 is common.

Avoiding code points that need escaped in JSON seems reasonable for some use 
cases. For things like storing the author field in obs markers, it is not.

I’d just as soon we vendor and use a binary serialization format like Protobuf, 
Thrift, Capnproto, Msgpack, Avro, etc. Bonus points if Rust’s serde crate can 
parse it using zero copy.
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Reply via email to