I'm spinning a discussion that was occurring in COUCHDB-345 (http://issues.apache.org/jira/browse/COUCHDB-345
) over to the mailing list since it was growing beyond the immediate
issue reported. The reported problem was that CouchDB would accept
PUT requests without checking that the content contained valid UTF-8
encoded data which would result in documents that could not be
retrieved, would disrupt view generation and potentially have other
adverse side-effects.
Noah Slater added a comment - 29/Aug/09 09:40 AM
I disagree with Curt.
The JSON RFC is either wrong or carelessly worded. You cannot encode
anything as Unicode because Unicode is not an encoding, it is a
collection of code points that have no binary representation. You
can encode these code points into character data, and you can decode
the same character data into Unicode. Unicode is always some
internal representation after decoding, and before encoding. I am
guessing everyone already knows this, but I keep seeing people form
arguments (particularly on IRC) that start with "since JSON has to
be encoded as Unicode" which is just a meaningless sentence (and the
RFC is to blame as it uses this wording) and hence conclusions that
follow from that have tended to be false.
I agree that is unfortunately worded. I checked the IETF RFC errata
page (http://www.rfc-editor.org/errata_search.php?rfc=4627) and did
not find a clarification on this issue. I basically interpreted "in
Unicode" as "in a Unicode Transformation Format" and more specifically
as in a UTF recommended by the Unicode consortium and in widespread use.
There is this quote from http://www.json.org/fatfree.html:
The character encoding of JSON text is always Unicode. UTF-8 is the
only encoding that makes sense on the wire, but UTF-16 and UTF-32
are also permitted.
It still uses the troublesome meme "character encoding ... Unicode",
however it seems to be a stretch to read that and think that Shift-
JIS, ISO-8559-8, MacLatin, EBCDIC, etc are also fine and dandy.
Don Box also seems to have a similar interpretation (http://www.pluralsight.com/community/blogs/dbox/archive/2007/01/03/45560.aspx
):
2. Like Joe Gregorio states in the comments on Tim's post, I also
prefer JSON's simplification of only allowing UTF-*-based encoding.
Tim Bray in http://www.tbray.org/ongoing/When/200x/2006/12/21/JSON had
an interesting comment in:
I look at the Ruby JSON library, for example, and I see all these
character encoding conversion routines; blecch.
Use JSON · Seems easy to me; if you want to serialize a data
structure that’s not too text-heavy and all you want is for the
receiver to get the same data structure with minimal effort, and you
trust the other end to get the i18n right, JSON is hunky-dory.
This hints things in the field aren't all pristine UTF encodings
however.
Probably best to ping the RFC editor to see if there is a clarification.
In this case, what the JSON RFC should say is that JSON should be
encoded from Unicode, which means that the encoding could be
anything from ISO-8859-1 to Shift JIS, which means that we cannot
"unambiguously determine the encoding from the content." Even if we
decided to only allow UTF-8, UTF-16, or UTF-32, we could only
"unambiguously determine the encoding" if the request body included
the BOM, which is entirely optional. So again, without the Content-
Encoding information, we are forced to use a heuristic. Heuristics
already exist, and where they are not already available in Erlang, I
rather suspect that they can be ported with relative ease.
I haven't worked through all the sequences, but since we know that the
first characters of a JSON string is either " " or "{", that should be
enough to unambiguously determine whether the only possible encoding
of the set UTF-8, UTF-16BE, UTF-16LE, or UTF-32.
If we can access the Content-Encoding, we should absolutely use it,
and absolutely reject as garbage any request that could not be
decoded with the explicit encoding. Any patch that will-fully
ignored this information only to fall back onto a heuristic would
get my emphatic veto. I am however satisfied with requiring UTF-8 in
the short term, and adding Content-Encoding awareness at some later
point.
The RFC says "A JSON parser MAY accept non-JSON forms or extensions."
So unlike an XML processor that is prohibited from assuming ISO-8859-1
if it encounters a invalid UTF-8 sequence, a JSON parser could
transparently assume ISO-8859-1 after encountering a bad UTF-8
sequence. Whether that would be a good thing is debatable.
There are a couple of questions that could be addressed:
1. How to treat a JSON request entity that does not contain a Content-
Encoding header. Particularly when the entity is not consistent with
the expected encoding.
2. How to treat a JSON request with a specified Content-Encoding.
What encodings would be supported? What would CouchDB do for an
unsupported encoding? What would occur if the entity was not
consistent with the encoding?
3. What should CouchDB send when there is no "Accept-Charset" in the
request.
4. What should CouchDB send where there is an "Accept-Charset" in the
request. Particularly if the request does not contain a UTF.
I think the current answers are:
1. Entity is interpreted as UTF-8. Currently if the encoding is
inconsistent, it is still committed to the database and bad things
happen later. If a fix for COUCHDB-345 is committed, then CouchDB
would reject the request with a 400.
2. Same as 1, Content-Encoding is not considered.
3. CouchDB always sends UTF-8.
4. Same as 3, Accept-Charset is not considered.
It is not a pressing issue for me and since COUCHDB-345 languished for
such a long time, I'm not thinking that many people are trying to push
other encodings into the database with the exceptions of people
pushing ISO-8859-1 up but not getting burned since their content
hasn't yet contained non ASCII characters.