On Mon, Jun 10, 2013 at 11:20:13AM -0400, Andrew Dunstan wrote: > > On 06/10/2013 10:18 AM, Tom Lane wrote: >> Andrew Dunstan <and...@dunslane.net> writes: >>> After thinking about this some more I have come to the conclusion that >>> we should only do any de-escaping of \uxxxx sequences, whether or not >>> they are for BMP characters, when the server encoding is utf8. For any >>> other encoding, which is already a violation of the JSON standard >>> anyway, and should be avoided if you're dealing with JSON, we should >>> just pass them through even in text output. This will be a simple and >>> very localized fix. >> Hmm. I'm not sure that users will like this definition --- it will seem >> pretty arbitrary to them that conversion of \u sequences happens in some >> databases and not others.
Yep. Suppose you have a LATIN1 database. Changing it to a UTF8 database where everyone uses client_encoding = LATIN1 should not change the semantics of successful SQL statements. Some statements that fail with one database encoding will succeed in the other, but a user should not witness a changed non-error result. (Except functions like decode() that explicitly expose byte representations.) Having "SELECT '["\u00e4"]'::json ->> 0" emit 'รค' in the UTF8 database and '\u00e4' in the LATIN1 database would move PostgreSQL in the wrong direction relative to that ideal. > Then what should we do when there is no matching codepoint in the > database encoding? First we'll have to delay the evaluation so it's not > done over-eagerly, and then we'll have to try the conversion and throw > an error if it doesn't work. The second part is what's happening now, > but the delayed evaluation is not. +1 for doing it that way. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers