On 06/10/2013 11:43 AM, Tom Lane wrote:
Andrew Dunstan <and...@dunslane.net> writes:
Or we could abandon the conversion altogether, but that doesn't seem
very friendly either. I suspect the biggest case for people to use these
sequences is where the database is UTF8 but the client encoding is not.
Well, if that's actually the biggest use-case, then maybe we should just
say we're *not* in the business of converting those escapes.  That would
make things nice and consistent regardless of the DB encoding, and it
would avoid the problem of being able to input a value and then not
being able to output it again.

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes?  If so I would think that that
would be the most common usage.  If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

                        


We never store the converted values in the JSON object, nor do we return them from functions that return JSON. But many of the functions and operators that process the JSON have variants that return text instead of JSON, and in those cases, when the value returned is a JSON string, we do the following to it:

 * strip the outside quotes
 * de-escape the various escaped characters (i.e. everything preceded
   by a backslash in the railroad diagram for string at
   <http://www.json.org/>)


Here's an example of the difference:

   andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
     ?column?
   ----------
     "\u00a9"
   (1 row)

   andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
     ?column?
   ----------
     ©
   (1 row)

It's the process of producing the latter that is giving us a headache in non-UTF8 databases.

... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot bigger than that. We also use this value for field name comparison. So, let us suppose that we have a LATIN1 database and a piece of JSON with a field name containing the Euro sign ("\u20ac"), a character that is not in LATIN1. Making that processable so it doesn't blow up would be mighty tricky and error prone. The non-orthogonality I suggested as a solution upthread is, by contrast, very small and easy to manage, and not terribly hard to explain - see attached.

cheers

andrew

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3adb365..592420a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10161,6 +10161,17 @@ table2-mapping
 
   <note>
     <para>
+      The text-returning variants of these functions and operators will convert Unicode escapes
+      in the JSON text to the appropriate UTF8 character when the database encoding is UTF8. In
+      other encodings the escape sequence is simply preserved as part of the text value, since we
+      can't be sure that the Unicode code point has a matching code point in the database encoding.
+      In general, it is best to avoid mixing Unicode escapes in JSON with a non-UTF8 database
+      encoding, if possible.
+    </para>
+  </note>
+
+  <note>
+    <para>
       The <xref linkend="hstore"> extension has a cast from <type>hstore</type> to
       <type>json</type>, so that converted <type>hstore</type> values are represented as JSON objects,
       not as string values.
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index d8046c5..bb8aa4f 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -717,7 +717,6 @@ json_lex_string(JsonLexContext *lex)
 				{
 					char		utf8str[5];
 					int			utf8len;
-					char	   *converted;
 
 					if (ch >= 0xd800 && ch <= 0xdbff)
 					{
@@ -749,13 +748,31 @@ json_lex_string(JsonLexContext *lex)
 								 errdetail("low order surrogate must follow a high order surrogate."),
 								 report_json_context(lex)));
 
-					unicode_to_utf8(ch, (unsigned char *) utf8str);
-					utf8len = pg_utf_mblen((unsigned char *) utf8str);
-					utf8str[utf8len] = '\0';
-					converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);
-					appendStringInfoString(lex->strval, converted);
-					if (converted != utf8str)
-						pfree(converted);
+					/*
+					 * For UTF8, replace the escape sequence by the actual utf8
+					 * character in lex->strval. For other encodings, just pass
+					 * the escape sequence through, since the chances are very
+					 * high that the database encoding won't have a matching
+					 * codepoint - that's one of the possible reasons that the
+					 * user used unicode escapes in the first place.
+					 */
+
+					if (GetDatabaseEncoding() == PG_UTF8)
+					{
+						unicode_to_utf8(ch, (unsigned char *) utf8str);
+						utf8len = pg_utf_mblen((unsigned char *) utf8str);
+						appendBinaryStringInfo(lex->strval, utf8str, utf8len);
+					}
+					else if (ch >= 0x10000)
+					{
+						/* must have been a surrogate pair */
+						appendBinaryStringInfo(lex->strval, s-12, 12);
+					}
+					else
+					{
+						/* simple escape - a single \uxxxx */
+						appendBinaryStringInfo(lex->strval, s-6, 6);
+					}
 
 				}
 			}
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to