On Sat, Aug 29, 2009 at 10:51:51PM -0500, Curt Arnold wrote:
> I agree that is unfortunately worded. I checked the IETF RFC errata
> page (http://www.rfc-editor.org/errata_search.php?rfc=4627) and did not
> find a clarification on this issue. I basically interpreted "in
> Unicode" as "in a Unicode Transformation Format" and more specifically
> as in a UTF recommended by the Unicode consortium and in widespread use.
"A string is a sequence of zero or more Unicode characters"
- http://www.ietf.org/rfc/rfc4627.txt
Only possible meaning is any encoding of any Unicode code point.
"All Unicode characters may be placed within the quotation marks[...]"
- ibid
Ditto.
"JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
- ibid
"encoded in Unicode" is nonsensical, so the only possible parsing is:
"JSON text SHALL be encoded from Unicode. The default encoding is UTF-8."
- ibid
Which means that any encoding is good.
Those are the only mentions of Unicode in the entire specification.
If we search for "encoding" we get:
"[...] the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair"
- ibid
"JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written
in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32,
the binary content-transfer-encoding must be used."
- ibid
>From this we can conclude:
* JSON can be encoded in UTF-8, UTF-16, or UTF-32.
* The editor of RFC 4627 was high.
> There is this quote from http://www.json.org/fatfree.html:
>>
>> The character encoding of JSON text is always Unicode. UTF-8 is the
>> only encoding that makes sense on the wire, but UTF-16 and UTF-32 are
>> also permitted.
>
> It still uses the troublesome meme "character encoding ... Unicode",
> however it seems to be a stretch to read that and think that Shift-JIS,
> ISO-8559-8, MacLatin, EBCDIC, etc are also fine and dandy.
The RFC demonstrates conclusively that the only allowable encodings are:
UTF-8, UTF-16, or UTF-32
> Don Box also seems to have a similar interpretation
> (http://www.pluralsight.com/community/blogs/dbox/archive/2007/01/03/45560.aspx
> ):
>>
>> 2. Like Joe Gregorio states in the comments on Tim's post, I also
>> prefer JSON's simplification of only allowing UTF-*-based encoding.
>
> Tim Bray in http://www.tbray.org/ongoing/When/200x/2006/12/21/JSON had
> an interesting comment in:
>>
>> I look at the Ruby JSON library, for example, and I see all these
>> character encoding conversion routines; blecch.
>> Use JSON · Seems easy to me; if you want to serialize a data structure
>> that’s not too text-heavy and all you want is for the receiver to get
>> the same data structure with minimal effort, and you trust the other
>> end to get the i18n right, JSON is hunky-dory.
>
> This hints things in the field aren't all pristine UTF encodings
> however.
>
> Probably best to ping the RFC editor to see if there is a clarification.
UTF-8, UTF-16, or UTF-32.
>> In this case, what the JSON RFC should say is that JSON should be
>> encoded from Unicode, which means that the encoding could be anything
>> from ISO-8859-1 to Shift JIS, which means that we cannot
>> "unambiguously determine the encoding from the content." Even if we
>> decided to only allow UTF-8, UTF-16, or UTF-32, we could only
>> "unambiguously determine the encoding" if the request body included
>> the BOM, which is entirely optional. So again, without the Content-
>> Encoding information, we are forced to use a heuristic. Heuristics
>> already exist, and where they are not already available in Erlang, I
>> rather suspect that they can be ported with relative ease.
>
> I haven't worked through all the sequences, but since we know that the
> first characters of a JSON string is either " " or "{", that should be
> enough to unambiguously determine whether the only possible encoding of
> the set UTF-8, UTF-16BE, UTF-16LE, or UTF-32.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
- ibid
>> If we can access the Content-Encoding, we should absolutely use it,
>> and absolutely reject as garbage any request that could not be decoded
>> with the explicit encoding. Any patch that will-fully ignored this
>> information only to fall back onto a heuristic would get my emphatic
>> veto. I am however satisfied with requiring UTF-8 in the short term,
>> and adding Content-Encoding awareness at some later point.
>
> The RFC says "A JSON parser MAY accept non-JSON forms or extensions."
> So unlike an XML processor that is prohibited from assuming ISO-8859-1
> if it encounters a invalid UTF-8 sequence, a JSON parser could
> transparently assume ISO-8859-1 after encountering a bad UTF-8 sequence.
> Whether that would be a good thing is debatable.
ISO-8859-1 JSON is invalid JSON.
> There are a couple of questions that could be addressed:
>
> 1. How to treat a JSON request entity that does not contain a Content-
> Encoding header. Particularly when the entity is not consistent with
> the expected encoding.
Considering we can reliably determine the encoding, we can:
* Ignore the Content-Encoding header completely.
* Reject any request where the Content-Encoding header is wrong.
The jerk in me wants to opt for the second option, but there is always Postel's.
> 2. How to treat a JSON request with a specified Content-Encoding. What
> encodings would be supported? What would CouchDB do for an unsupported
> encoding? What would occur if the entity was not consistent with the
> encoding?
Ignore it or barf. See above dilemma of jerk vs. hippie approach.
UTF-8, UTF-16, or UTF-32.
Barf.
Ignore it or barf. As above.
> 3. What should CouchDB send when there is no "Accept-Charset" in the
> request.
UTF-8.
> 4. What should CouchDB send where there is an "Accept-Charset" in the
> request. Particularly if the request does not contain a UTF.
406 Not acceptable.
> I think the current answers are:
>
> 1. Entity is interpreted as UTF-8. Currently if the encoding is
> inconsistent, it is still committed to the database and bad things
> happen later. If a fix for COUCHDB-345 is committed, then CouchDB would
> reject the request with a 400.
+1
> 2. Same as 1, Content-Encoding is not considered.
Undecided.
> 3. CouchDB always sends UTF-8.
+1
> 4. Same as 3, Accept-Charset is not considered.
-1
Come on, you gotta give me this. It's fun to send back 406! Stupid clients.
> It is not a pressing issue for me and since COUCHDB-345 languished for
> such a long time, I'm not thinking that many people are trying to push
> other encodings into the database with the exceptions of people pushing
> ISO-8859-1 up but not getting burned since their content hasn't yet
> contained non ASCII characters.
I apologise for not reading to (totally bonkers) RFC properly until now.
Did I mention it's totally bonkers?
Best,
--
Noah Slater, http://tumbolia.org/nslater