Re: [HACKERS] JSON and unicode surrogate pairs

Andrew Dunstan Mon, 10 Jun 2013 10:03:19 -0700


On 06/10/2013 11:43 AM, Tom Lane wrote:

Andrew Dunstan <[email protected]> writes:

Or we could abandon the conversion altogether, but that doesn't seem
very friendly either. I suspect the biggest case for people to use these
sequences is where the database is UTF8 but the client encoding is not.

Well, if that's actually the biggest use-case, then maybe we should just
say we're *not* in the business of converting those escapes.  That would
make things nice and consistent regardless of the DB encoding, and it
would avoid the problem of being able to input a value and then not
being able to output it again.


It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes?  If so I would think that that
would be the most common usage.  If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we returnthem from functions that return JSON. But many of the functions andoperators that process the JSON have variants that return text insteadof JSON, and in those cases, when the value returned is a JSON string,we do the following to it:


 * strip the outside quotes
 * de-escape the various escaped characters (i.e. everything preceded
   by a backslash in the railroad diagram for string at
   <http://www.json.org/>)


Here's an example of the difference:

   andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
     ?column?
   ----------
     "\u00a9"
   (1 row)

   andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
     ?column?
   ----------
     ©
   (1 row)

It's the process of producing the latter that is giving us a headache innon-UTF8 databases.


... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot biggerthan that. We also use this value for field name comparison. So, let ussuppose that we have a LATIN1 database and a piece of JSON with a fieldname containing the Euro sign ("\u20ac"), a character that is not inLATIN1. Making that processable so it doesn't blow up would be mightytricky and error prone. The non-orthogonality I suggested as a solutionupthread is, by contrast, very small and easy to manage, and notterribly hard to explain - see attached.


cheers

andrew

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3adb365..592420a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10161,6 +10161,17 @@ table2-mapping
 
   <note>
     <para>
+      The text-returning variants of these functions and operators will convert Unicode escapes
+      in the JSON text to the appropriate UTF8 character when the database encoding is UTF8. In
+      other encodings the escape sequence is simply preserved as part of the text value, since we
+      can't be sure that the Unicode code point has a matching code point in the database encoding.
+      In general, it is best to avoid mixing Unicode escapes in JSON with a non-UTF8 database
+      encoding, if possible.
+    </para>
+  </note>
+
+  <note>
+    <para>
       The <xref linkend="hstore"> extension has a cast from <type>hstore</type> to
       <type>json</type>, so that converted <type>hstore</type> values are represented as JSON objects,
       not as string values.
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index d8046c5..bb8aa4f 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -717,7 +717,6 @@ json_lex_string(JsonLexContext *lex)
 				{
 					char		utf8str[5];
 					int			utf8len;
-					char	   *converted;
 
 					if (ch >= 0xd800 && ch <= 0xdbff)
 					{
@@ -749,13 +748,31 @@ json_lex_string(JsonLexContext *lex)
 								 errdetail("low order surrogate must follow a high order surrogate."),
 								 report_json_context(lex)));
 
-					unicode_to_utf8(ch, (unsigned char *) utf8str);
-					utf8len = pg_utf_mblen((unsigned char *) utf8str);
-					utf8str[utf8len] = '\0';
-					converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);
-					appendStringInfoString(lex->strval, converted);
-					if (converted != utf8str)
-						pfree(converted);
+					/*
+					 * For UTF8, replace the escape sequence by the actual utf8
+					 * character in lex->strval. For other encodings, just pass
+					 * the escape sequence through, since the chances are very
+					 * high that the database encoding won't have a matching
+					 * codepoint - that's one of the possible reasons that the
+					 * user used unicode escapes in the first place.
+					 */
+
+					if (GetDatabaseEncoding() == PG_UTF8)
+					{
+						unicode_to_utf8(ch, (unsigned char *) utf8str);
+						utf8len = pg_utf_mblen((unsigned char *) utf8str);
+						appendBinaryStringInfo(lex->strval, utf8str, utf8len);
+					}
+					else if (ch >= 0x10000)
+					{
+						/* must have been a surrogate pair */
+						appendBinaryStringInfo(lex->strval, s-12, 12);
+					}
+					else
+					{
+						/* simple escape - a single \uxxxx */
+						appendBinaryStringInfo(lex->strval, s-6, 6);
+					}
 
 				}
 			}

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JSON and unicode surrogate pairs

Reply via email to