Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Álvaro Hernández Tortosa Wed, 03 Aug 2016 12:14:14 -0700


On 03/08/16 20:14, Álvaro Hernández Tortosa wrote:

On 03/08/16 17:47, Kevin Grittner wrote:
On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa<a...@8kdata.com> wrote:
     What would it take to support it?
Would it be of any value to support "Modified UTF-8"?

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
    That's nice, but I don't think so.
The problem is that you cannot predict how people would send youdata, like when importing from other databases. I guess it may work ifPostgres would implement such UTF-8 variant and also the drivers, butthat would still require an encoding conversion (i.e., parsing everystring) to change the 0x00, which seems like a serious performance hit.
    It could be worse than nothing, though!

    Thanks,

    Álvaro


    It may indeed work.

According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layoutthe encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 codepoint. In short, the \u00 nul is represented (overlong encoding) by thetwo-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8so should not appear in an otherwise valid UTF-8 string. Yet they areaccepted by Postgres (like if Postgres would support Modified UTF-8intentionally). The caracter in psql does not render as a nul but asthis symbol: "삀".


    Given that this works, the process would look like this:

- Parse all input data looking for bytes with hex value 0x00. If theyappear in the string, they are the null byte.

- Replace that byte with the two bytes 0xc080.
- Reverse the operation when reading.

This is OK but of course a performance hit (searching for 0x00 andthen augmenting the byte[] or whatever data structure to account for theextra byte). A little bit of a PITA, but I guess better than fixing itall :)



    Álvaro


--

Álvaro Hernández Tortosa


-----------
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Reply via email to