Re: Almost bug in COPY FROM processing of GB18030 encoded input

Heikki Linnakangas Fri, 25 Jan 2019 04:57:13 -0800

On 24/01/2019 23:27, Robert Haas wrote:

On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas <[email protected]> wrote:

I happened to notice that when CopyReadLineText() calls mblen(), it
passes only the first byte of the multi-byte characters. However,
pg_gb18030_mblen() looks at the first and the second byte.
CopyReadLineText() always passes \0 as the second byte, so
pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded
characters as 2.


It works out fine, though, because the second half of the 4-byte encoded
character always looks like another 2-byte encoded character, in
GB18030. CopyReadLineText() is looking for delimiter and escape
characters and newlines, and only single-byte characters are supported
for those, so treating a 4-byte character as two 2-byte characters is
harmless.


Yikes.


Committed the comment changes, so it's less of a gotcha now.

- Heikki

Re: Almost bug in COPY FROM processing of GB18030 encoded input

Reply via email to