On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> BTW, could the \uNNNN problem be finessed by leaving such escapes in
> source form?

Yes, it could.  However, it doesn't solve the problem of comparison
(needed for member lookup), which requires canonicalizing the strings
to be compared.

Here's a Unicode handling policy I came up with.  It is guaranteed to
be lossless as long as the client and database encodings are the same.

---

On input (json_in), if the text is valid JSON, it is condensed:

 * Whitespace characters surrounding tokens are removed.
 * Long escapes (like \u0022) are converted to short escapes (like \")
where possible.
 * Unnecessary escapes of ASCII characters (e.g. \u0061 and even
\u007F) are converted to their respective characters.
 * Escapes of non-ASCII characters (e.g. \u0080, \u266B, \uD834\uDD1E)
are converted to their respective characters, but only if the database
encoding is UTF-8.

On output (json_out), non-ASCII characters are converted to \uXXXX
escapes, unless one or more of these very likely circumstances hold:

 * The client encoding and database encoding are the same.  No
conversion is performed, so escaping characters will not prevent any
conversion errors.
 * The client encoding is UTF-8.  Escaping is not necessary because
the client can encode all Unicode codepoints.
 * The client encoding and/or database encoding is SQL_ASCII.
SQL_ASCII tells PostgreSQL to shirk transcoding in favor of speed.

When a JSON-encoded string is unwrapped using from_json (e.g.
from_json($$ "\u00A1Hola!" $$)), escapes will be converted to the
characters they represent.  If any escapes cannot be represented in
the database encoding, an error will be raised.  Note that:

 * If the database encoding is UTF-8, conversion will never fail.
 * If the database encoding is SQL_ASCII, conversion will fail if any
escapes of non-ASCII characters are present.

---

However, I'm having a really hard time figuring out how comparison
would work in this framework.  Here are a few options:

 1. Convert the strings to UTF-8, convert the escapes to characters,
and compare the strings.
 2. Convert the escapes to the database encoding, then compare the strings.
 3. If either string contains escapes of non-ASCII characters, do 1.
Otherwise, do 2.

Number 1 seems the most sane to me, but could lead to rare errors.

Number 3 could produce confusing results.  If character set X has
three different representations of one Unicode codepoint, then we
could have scenarios like this (simplified):

 "abc♫" != "aaa♫"

but:

 "abc\u266b" == "aaa♫"

I suppose a simple solution would be to convert all escapes and
outright ban escapes of characters not in the database encoding.  This
would have the nice property that all strings can be unescaped
server-side.  Problem is, what if a browser or other program produces,
say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database
where the encoding lacks this character?

On the other hand, converting all JSON to UTF-8 would be simpler to
implement.  It would probably be more intuitive, too, given that the
JSON RFC says, "JSON text SHALL be encoded in Unicode."

In any case, the documentation should say "Use UTF-8 for best
results", as there seems to be no entirely satisfactory way to handle
JSON in other database encodings.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to