On Wed, Oct 12, 2011 at 11:45 PM, Kyotaro HORIGUCHI <horiguchi.kyot...@oss.ntt.co.jp> wrote: > Hello, the work is finished. > > Version 4 of the patch is attached to this message.
I went through this in a bit more detail tonight and am cleaning it up. But I'm a bit confused, looking at pg_utf8_increment() in detail: - Why does the second byte need special handling for 0xED and 0xF4? AFAICT, UTF-8 requires all legal strings to have a second byte between 0x80 and 0xBF, just as in byte positions 3 and 4, so these bytes would be invalid in this position anyway. - In the first byte, we don't increment if the current value for that byte is 0x7F, 0xDF, 0xEF, or 0xF4. But why isn't it 0xF7 rather than 0xF4? I see there's a comparable restriction in pg_utf8_islegal(), but I don't understand why. - Perhaps on the same point, the comments claim that we will fail for code points U+0007F, U+007FF, U+0FFFF, and U+10FFFF. But IIUC, a 4-byte unicode character can encode values up to U+1FFFFF, so why is it U+10FFFF rather than U+1FFFFF? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers