[HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Álvaro Hernández Tortosa Wed, 03 Aug 2016 08:01:39 -0700


    Hi list.

As has been previously discussed (seehttps://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gblfor instance) varlena fields cannot accept the literal 0x00 value. Sure,you can use bytea, but this hardly a good solution. The problem seems tobe hitting some use cases, like:

- People migrating data from other databases (apart from PostgreSQL, Idon't know of any other database which suffers the same problem).- People using drivers which use UTF-8 or equivalent encodings bydefault (Java for example)

Given that 0x00 is a perfectly legal UTF-8 character, I concludewe're strictly non-compliant. And given the general Postgres policyregarding standards compliance and the people being hit by this, I thinkit should be addressed. Specially since all the usual fixes are a realPITA (re-parsing, re-generating strings, which is very expensive, ordropping data).

What would it take to support it? Isn't the varlena headerpropagated everywhere, which could help infer the real length of thestring? Any pointers or suggestions would be welcome.


    Thanks,

    Álvaro


--

Álvaro Hernández Tortosa


-----------
8Kdata



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Reply via email to