On Nov24, 2011, at 10:54 , Florian Weimer wrote:
>> Or is it not only about being able to *store* NULs in a text field?
> 
> No, the entire core should be NUL-transparent.

That's unlikely to happen. A more realistic approach would be to solve
this only for UTF-8 encoded strings by encoding the NUL character not as
a single 0 byte, but as sequence of non-0 bytes.

Such a thing is possible in UTF-8 because there are multiple ways to
encode the same character once you drop the requirement that characters
be encoded in the *shortest* possible way.

Since we very probably won't loosen up UTF-8's integrity checks to allow
that, it'd have to be done as a new encoding, say 'utf8-loose'.

That new encoding could, for example, use 0xC0 0x80 to represent NUL
characters. This byte sequence is invalid in standard-conforming UTF-8
because it's a non-normalized (i.e. overly long) representation a code
point (the code point NUL, incidentally). A bit of googling suggests that
quite a few piece of software use this kind of modified UTF-8 encoding.

Java, for example, seems to use it to serialize Strings (which may contain
NUL characters) to UTF-8.

Should you try to add a new encoding which supports that, you might also
want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means
that code points representable by UTF-16 surrogate pairs may be encoded by
separately encoding the two surrogate characters in UTF-8.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to