Character encodings and JSON RFC (spun off from COUCHDB-345)

Curt Arnold Sat, 29 Aug 2009 20:52:30 -0700

I'm spinning a discussion that was occurring in COUCHDB-345 (http://issues.apache.org/jira/browse/COUCHDB-345) over to the mailing list since it was growing beyond the immediateissue reported. The reported problem was that CouchDB would acceptPUT requests without checking that the content contained valid UTF-8encoded data which would result in documents that could not beretrieved, would disrupt view generation and potentially have otheradverse side-effects.


Noah Slater added a comment - 29/Aug/09 09:40 AM

I disagree with Curt.
The JSON RFC is either wrong or carelessly worded. You cannot encodeanything as Unicode because Unicode is not an encoding, it is acollection of code points that have no binary representation. Youcan encode these code points into character data, and you can decodethe same character data into Unicode. Unicode is always someinternal representation after decoding, and before encoding. I amguessing everyone already knows this, but I keep seeing people formarguments (particularly on IRC) that start with "since JSON has tobe encoded as Unicode" which is just a meaningless sentence (and theRFC is to blame as it uses this wording) and hence conclusions thatfollow from that have tended to be false.

I agree that is unfortunately worded. I checked the IETF RFC erratapage (http://www.rfc-editor.org/errata_search.php?rfc=4627) and didnot find a clarification on this issue. I basically interpreted "inUnicode" as "in a Unicode Transformation Format" and more specificallyas in a UTF recommended by the Unicode consortium and in widespread use.



There is this quote from http://www.json.org/fatfree.html:

The character encoding of JSON text is always Unicode. UTF-8 is theonly encoding that makes sense on the wire, but UTF-16 and UTF-32are also permitted.

It still uses the troublesome meme "character encoding ... Unicode",however it seems to be a stretch to read that and think that Shift-JIS, ISO-8559-8, MacLatin, EBCDIC, etc are also fine and dandy.

Don Box also seems to have a similar interpretation (http://www.pluralsight.com/community/blogs/dbox/archive/2007/01/03/45560.aspx):

2. Like Joe Gregorio states in the comments on Tim's post, I alsoprefer JSON's simplification of only allowing UTF-*-based encoding.

Tim Bray in http://www.tbray.org/ongoing/When/200x/2006/12/21/JSON hadan interesting comment in:

I look at the Ruby JSON library, for example, and I see all thesecharacter encoding conversion routines; blecch.Use JSON · Seems easy to me; if you want to serialize a datastructure that’s not too text-heavy and all you want is for thereceiver to get the same data structure with minimal effort, and youtrust the other end to get the i18n right, JSON is hunky-dory.

This hints things in the field aren't all pristine UTF encodingshowever.


Probably best to ping the RFC editor to see if there is a clarification.

In this case, what the JSON RFC should say is that JSON should beencoded from Unicode, which means that the encoding could beanything from ISO-8859-1 to Shift JIS, which means that we cannot"unambiguously determine the encoding from the content." Even if wedecided to only allow UTF-8, UTF-16, or UTF-32, we could only"unambiguously determine the encoding" if the request body includedthe BOM, which is entirely optional. So again, without the Content-Encoding information, we are forced to use a heuristic. Heuristicsalready exist, and where they are not already available in Erlang, Irather suspect that they can be ported with relative ease.

I haven't worked through all the sequences, but since we know that thefirst characters of a JSON string is either " " or "{", that should beenough to unambiguously determine whether the only possible encodingof the set UTF-8, UTF-16BE, UTF-16LE, or UTF-32.

If we can access the Content-Encoding, we should absolutely use it,and absolutely reject as garbage any request that could not bedecoded with the explicit encoding. Any patch that will-fullyignored this information only to fall back onto a heuristic wouldget my emphatic veto. I am however satisfied with requiring UTF-8 inthe short term, and adding Content-Encoding awareness at some laterpoint.

The RFC says "A JSON parser MAY accept non-JSON forms or extensions."So unlike an XML processor that is prohibited from assuming ISO-8859-1if it encounters a invalid UTF-8 sequence, a JSON parser couldtransparently assume ISO-8859-1 after encountering a bad UTF-8sequence. Whether that would be a good thing is debatable.


There are a couple of questions that could be addressed:

1. How to treat a JSON request entity that does not contain a Content-Encoding header. Particularly when the entity is not consistent withthe expected encoding.

2. How to treat a JSON request with a specified Content-Encoding.What encodings would be supported? What would CouchDB do for anunsupported encoding? What would occur if the entity was notconsistent with the encoding?

3. What should CouchDB send when there is no "Accept-Charset" in therequest.

4. What should CouchDB send where there is an "Accept-Charset" in therequest. Particularly if the request does not contain a UTF.


I think the current answers are:

1. Entity is interpreted as UTF-8. Currently if the encoding isinconsistent, it is still committed to the database and bad thingshappen later. If a fix for COUCHDB-345 is committed, then CouchDBwould reject the request with a 400.


2. Same as 1, Content-Encoding is not considered.

3. CouchDB always sends UTF-8.

4. Same as 3, Accept-Charset is not considered.

It is not a pressing issue for me and since COUCHDB-345 languished forsuch a long time, I'm not thinking that many people are trying to pushother encodings into the database with the exceptions of peoplepushing ISO-8859-1 up but not getting burned since their contenthasn't yet contained non ASCII characters.

Character encodings and JSON RFC (spun off from COUCHDB-345)

Reply via email to