Re: [HACKERS] chr() is still too loose about UTF8 code points

Heikki Linnakangas Fri, 16 May 2014 09:45:13 -0700

On 05/16/2014 06:05 PM, Tom Lane wrote:

Quite some time ago, we made the chr() function accept Unicode code points
up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8
string.  It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF (for
compatibility with UTF16).  While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.


This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
  f1
----

(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf
CONTEXT:  COPY tt, line 1
LOCATION:  report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10ffff.  Should we back-patch that, or just do it in HEAD?

+1 for back-patching. A value that cannot be restored is bad, and Ican't imagine any legitimate use case for producing a Unicode characterlarger than U+10FFFF with chr(x), when the rest of the system doesn'thandle it. Fully supporting such values might be useful, but that's adifferent story.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] chr() is still too loose about UTF8 code points

Reply via email to