On 01/20/2012 11:58 AM, Robert Haas wrote:
On Fri, Jan 20, 2012 at 10:45 AM, Andrew Dunstan<and...@dunslane.net>  wrote:
XML's&#nnnn; escape mechanism is more or less the equivalent of JSON's
\unnnn. But XML documents can be encoded in a variety of encodings,
including non-unicode encodings such as Latin-1. However, no matter what the
document encoding,&#nnnn; designates the character with Unicode code point
nnnn, whether or not that is part of the document encoding's charset.
OK.

Given that precedent, I'm wondering if we do need to enforce anything other
than that it is a valid unicode code point.

Equivalence comparison is going to be difficult anyway if you're not
resolving all \unnnn escapes. Possibly we need some sort of canonicalization
function to apply for comparison purposes. But we're not providing any
comparison ops today anyway, so I don't think we need to make that decision
now. As you say, there doesn't seem to be any defined canonical form - the
spec is a bit light on in this respect.
Well, we clearly have to resolve all \uXXXX to do either comparison or
canonicalization.  The current patch does neither, but presumably we
want to leave the door open to such things.  If we're using UTF-8 and
comparing two strings, and we get to a position where one of them has
a character and the other has \uXXXX, it's pretty simple to do the
comparison: we just turn XXXX into a wchar_t and test for equality.
That should be trivial, unless I'm misunderstanding.  If, however,
we're not using UTF-8, we have to first turn \uXXXX into a Unicode
code point, then covert that to a character in the database encoding,
and then test for equality with the other character after that.  I'm
not sure whether that's possible in general, how to do it, or how
efficient it is.  Can you or anyone shed any light on that topic?


We know perfectly well how to turn two strings from encoding x to utf8 (see mb_utils.c::pg_do_encoding_conversion() ). Once we've done that ISTM we have reduced this to the previous problem, as the mathematicians like to say.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to