Oh, how do I hate character encoding issues. Before you get too roped into this, if you can answer this question in the negative, do so quickly:
Can UTF-8 represent all possible unicode encodings? I'm gonna assume yes for the rest of this post. > What Curt said... I think we're in a bit of a weird spot here cause we're playing with the head butt of two different RFC's. The HTTP transport RFC that deals with Content-Encoding and charset awesomeness and the JSON RFC that is so full of ambiguity I'd like to kick it. On the plus side, there's so much ambiguity here that we can basically do whatever we want and no one can accuse us of being wrong. That said, I think we should isolate concerns. Unless someone want's to write a JSON parser that understands multiple character encodings and doesn't suck ass performance wise, we should probably just assume the JSON parser is UTF-8 only. Before anyone goes hollering about that, we still have the HTTP layer to play with in terms of accepting content encoding. And nothing in the HTTP layer says we have to accept UC-4 or NR-17 or whatever. So while we're more than welcome to reject any request bodies way before they hit the JSON serializer, Noah would probably cut my throat for suggesting we don't play nice. Either way, this big conversation on character encodings should probably focus on how we move things to UTF-8 which I officially nominate as the already de-facto CouchDB character encoding. > There are a couple of questions that could be addressed: > > 1. How to treat a JSON request entity that does not contain a > Content-Encoding header. Particularly when the entity is not consistent > with the expected encoding. Assume UTF-8. If fail, maybe try guessing. If fail, throw a meatball at the client saying rejected. We already ignore quite a few headers and do things "Non-RESTful-ly" so I'm not too concerned. > 2. How to treat a JSON request with a specified Content-Encoding. If the encoding is understood, transcode to a UTF-8 representation. > What encodings would be supported? Patches welcome. UTF-8 currently kinda sort supported. > What would CouchDB do for an unsupported encoding? Tell the client that we don't support their weirdo character encoding and that patches are welcome at the CouchDB JIRA page that no one likes visiting cause Java is the devil. Maybe we don't mention that last bit though? > What would occur if the entity was not consistent with the encoding? If a client goes out of their way to specify a Content-Encoding and they send shit that doesn't comply then we should throw a huge pie at them and drop the connection. I'm thinking of a Nelson "Ha, ha!" and pointing of many fingers. > 3. What should CouchDB send when there is no "Accept-Charset" in the > request. UTF-8. Cause its yummy. > 4. What should CouchDB send where there is an "Accept-Charset" in the > request. Particularly if the request does not contain a UTF. > If we undersand it, transcode UTF-8 to the requested charset. Otherwise, say "Can't do it!". HTH, Paul Davis
