I'm spinning a discussion that was occurring in COUCHDB-345 (http://issues.apache.org/jira/browse/COUCHDB-345 ) over to the mailing list since it was growing beyond the immediate issue reported. The reported problem was that CouchDB would accept PUT requests without checking that the content contained valid UTF-8 encoded data which would result in documents that could not be retrieved, would disrupt view generation and potentially have other adverse side-effects.

Noah Slater added a comment - 29/Aug/09 09:40 AM
I disagree with Curt.

The JSON RFC is either wrong or carelessly worded. You cannot encode anything as Unicode because Unicode is not an encoding, it is a collection of code points that have no binary representation. You can encode these code points into character data, and you can decode the same character data into Unicode. Unicode is always some internal representation after decoding, and before encoding. I am guessing everyone already knows this, but I keep seeing people form arguments (particularly on IRC) that start with "since JSON has to be encoded as Unicode" which is just a meaningless sentence (and the RFC is to blame as it uses this wording) and hence conclusions that follow from that have tended to be false.


I agree that is unfortunately worded. I checked the IETF RFC errata page (http://www.rfc-editor.org/errata_search.php?rfc=4627) and did not find a clarification on this issue. I basically interpreted "in Unicode" as "in a Unicode Transformation Format" and more specifically as in a UTF recommended by the Unicode consortium and in widespread use.


There is this quote from http://www.json.org/fatfree.html:

The character encoding of JSON text is always Unicode. UTF-8 is the only encoding that makes sense on the wire, but UTF-16 and UTF-32 are also permitted.

It still uses the troublesome meme "character encoding ... Unicode", however it seems to be a stretch to read that and think that Shift- JIS, ISO-8559-8, MacLatin, EBCDIC, etc are also fine and dandy.

Don Box also seems to have a similar interpretation (http://www.pluralsight.com/community/blogs/dbox/archive/2007/01/03/45560.aspx ):

2. Like Joe Gregorio states in the comments on Tim's post, I also prefer JSON's simplification of only allowing UTF-*-based encoding.

Tim Bray in http://www.tbray.org/ongoing/When/200x/2006/12/21/JSON had an interesting comment in:

I look at the Ruby JSON library, for example, and I see all these character encoding conversion routines; blecch. Use JSON · Seems easy to me; if you want to serialize a data structure that’s not too text-heavy and all you want is for the receiver to get the same data structure with minimal effort, and you trust the other end to get the i18n right, JSON is hunky-dory.

This hints things in the field aren't all pristine UTF encodings however.

Probably best to ping the RFC editor to see if there is a clarification.


In this case, what the JSON RFC should say is that JSON should be encoded from Unicode, which means that the encoding could be anything from ISO-8859-1 to Shift JIS, which means that we cannot "unambiguously determine the encoding from the content." Even if we decided to only allow UTF-8, UTF-16, or UTF-32, we could only "unambiguously determine the encoding" if the request body included the BOM, which is entirely optional. So again, without the Content- Encoding information, we are forced to use a heuristic. Heuristics already exist, and where they are not already available in Erlang, I rather suspect that they can be ported with relative ease.

I haven't worked through all the sequences, but since we know that the first characters of a JSON string is either " " or "{", that should be enough to unambiguously determine whether the only possible encoding of the set UTF-8, UTF-16BE, UTF-16LE, or UTF-32.



If we can access the Content-Encoding, we should absolutely use it, and absolutely reject as garbage any request that could not be decoded with the explicit encoding. Any patch that will-fully ignored this information only to fall back onto a heuristic would get my emphatic veto. I am however satisfied with requiring UTF-8 in the short term, and adding Content-Encoding awareness at some later point.

The RFC says "A JSON parser MAY accept non-JSON forms or extensions." So unlike an XML processor that is prohibited from assuming ISO-8859-1 if it encounters a invalid UTF-8 sequence, a JSON parser could transparently assume ISO-8859-1 after encountering a bad UTF-8 sequence. Whether that would be a good thing is debatable.

There are a couple of questions that could be addressed:

1. How to treat a JSON request entity that does not contain a Content- Encoding header. Particularly when the entity is not consistent with the expected encoding.

2. How to treat a JSON request with a specified Content-Encoding. What encodings would be supported? What would CouchDB do for an unsupported encoding? What would occur if the entity was not consistent with the encoding?

3. What should CouchDB send when there is no "Accept-Charset" in the request.

4. What should CouchDB send where there is an "Accept-Charset" in the request. Particularly if the request does not contain a UTF.

I think the current answers are:

1. Entity is interpreted as UTF-8. Currently if the encoding is inconsistent, it is still committed to the database and bad things happen later. If a fix for COUCHDB-345 is committed, then CouchDB would reject the request with a 400.

2. Same as 1, Content-Encoding is not considered.

3. CouchDB always sends UTF-8.

4. Same as 3, Accept-Charset is not considered.


It is not a pressing issue for me and since COUCHDB-345 languished for such a long time, I'm not thinking that many people are trying to push other encodings into the database with the exceptions of people pushing ISO-8859-1 up but not getting burned since their content hasn't yet contained non ASCII characters.

Reply via email to