Re: Character encodings and JSON RFC (spun off from COUCHDB-345)

Paul Davis Sat, 29 Aug 2009 21:34:23 -0700

Oh, how do I hate character encoding issues.

Before you get too roped into this, if you can answer this question in
the negative, do so quickly:


Can UTF-8 represent all possible unicode encodings?

I'm gonna assume yes for the rest of this post.

> What Curt said...

I think we're in a bit of a weird spot here cause we're playing with
the head butt of two different RFC's. The HTTP transport RFC that
deals with Content-Encoding and charset awesomeness and the JSON RFC
that is so full of ambiguity I'd like to kick it. On the plus side,
there's so much ambiguity here that we can basically do whatever we
want and no one can accuse us of being wrong.

That said, I think we should isolate concerns. Unless someone want's
to write a JSON parser that understands multiple character encodings
and doesn't suck ass performance wise, we should probably just assume
the JSON parser is UTF-8 only.

Before anyone goes hollering about that, we still have the HTTP layer
to play with in terms of accepting content encoding. And nothing in
the HTTP layer says we have to accept UC-4 or NR-17 or whatever. So
while we're more than welcome to reject any request bodies way before
they hit the JSON serializer, Noah would probably cut my throat for
suggesting we don't play nice. Either way, this big conversation on
character encodings should probably focus on how we move things to
UTF-8 which I officially nominate as the already de-facto CouchDB
character encoding.

> There are a couple of questions that could be addressed:
>
> 1. How to treat a JSON request entity that does not contain a
> Content-Encoding header.  Particularly when the entity is not consistent
> with the expected encoding.

Assume UTF-8. If fail, maybe try guessing. If fail, throw a meatball
at the client saying rejected. We already ignore quite a few headers
and do things "Non-RESTful-ly" so I'm not too concerned.

> 2. How to treat a JSON request with a specified Content-Encoding.

If the encoding is understood, transcode to a UTF-8 representation.

> What encodings would be supported?

Patches welcome. UTF-8 currently kinda sort supported.

> What would CouchDB do for an unsupported encoding?

Tell the client that we don't support their weirdo character encoding
and that patches are welcome at the CouchDB JIRA page that no one
likes visiting cause Java is the devil. Maybe we don't mention that
last bit though?

> What would occur if the entity was not consistent with the encoding?

If a client goes out of their way to specify a Content-Encoding and
they send shit that doesn't comply then we should throw a huge pie at
them and drop the connection. I'm thinking of a Nelson "Ha, ha!" and
pointing of many fingers.

> 3. What should CouchDB send when there is no "Accept-Charset" in the
> request.

UTF-8. Cause its yummy.

> 4. What should CouchDB send where there is an "Accept-Charset" in the
> request.  Particularly if the request does not contain a UTF.
>

If we undersand it, transcode UTF-8 to the requested charset.
Otherwise, say "Can't do it!".

HTH,
Paul Davis

Re: Character encodings and JSON RFC (spun off from COUCHDB-345)

Reply via email to