On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan <and...@dunslane.net> wrote:
> Second, what should be do when the database encoding isn't UTF8? I'm
> inclined to emit a \unnnn escape for any non-ASCII character (assuming it
> has a unicode code point - are there any code points in the non-unicode
> encodings that don't have unicode equivalents?). The alternative would be to
> fail on non-ASCII characters, which might be ugly. Of course, anyone wanting
> to deal with JSON should be using UTF8 anyway, but we still have to deal
> with these things. What about SQL_ASCII? If there's a non-ASCII sequence
> there we really have no way of telling what it should be. There at least I
> think we should probably error out.

I don't think there is a satisfying solution to this problem.  Things
working against us:

 * Some server encodings support characters that don't map to Unicode
characters (e.g. unused slots in Windows-1252).  Thus, converting to
UTF-8 and back is lossy in general.

 * We want a normalized representation for comparison.  This will
involve a mixture of server and Unicode characters, unless the
encoding is UTF-8.

 * We can't efficiently convert individual characters to and from
Unicode with the current API.

 * What do we do about \u0000 ?  TEXT datums cannot contain NUL characters.

I'd say just ban Unicode escapes and non-ASCII characters unless the
server encoding is UTF-8, and ban all \u0000 escapes.  It's easy, and
whatever we support later will be a superset of this.

Strategies for handling this situation have been discussed in prior
emails.  This is where things got stuck last time.

- Joey

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to