Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Dennis Bjorklund Fri, 06 Aug 2004 23:38:21 -0700

On Sat, 7 Aug 2004, Tom Lane wrote:

> shy of a load --- for instance I see that pg_utf_mblen thinks there are
> no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
> an expert on this stuff, so I don't know what the UTF8 spec actually
> says.  But I do think you are fixing the code at the wrong level.


I can give some general info about utf-9. This is how it is encoded:

character            encoding
-------------------  ---------
00000000 - 0000007F: 0xxxxxxx
00000080 - 000007FF: 110xxxxx 10xxxxxx
00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

If the first byte starts with a 1 then the number of ones give the 
length of the utf-8 sequence. And the rest of the bytes in the sequence 
always starts with 10 (this makes it possble to look anywhere in the 
string and fast find the start of a character).

This also means that the start byte can never start with 7 or 8 ones, that 
is illegal and should be tested for and rejected. So the longest utf-8 
sequence is 6 bytes (and the longest character needs 4 bytes (or 31 
bits)).

-- 
/Dennis Björklund


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Reply via email to