UTF8 encoding is rather simple, really:

byte number:
b1              b2               b3              b4
0    -- 127                                                         = unicode 0x00 - 
0x7F
192 -- 223   128 -- 191                                       = unicode 0x80 - 0x7FF 
224 -- 239   128 -- 191   128 -- 191                     = unicode 0x800 - 0xFFFF 
240 -- 247   128 -- 191   128 -- 191   128 -- 191   = unicode 0x10000 - 0x1FFFF

There are also sequences for 5 and 6 bytes, but these are illegal for Unicode
representations at the moment:

248 -- 251   128 -- 191    128 -- 191    128 -- 191    128 -- 191   
252 -- 253   128 -- 191    128 -- 191    128 -- 191    128 -- 191    128 -- 191   
 
128 -- 191 are illegal as first chars in UTF8 (that is handy for error-recovery):

254 and 255 are completely illegal and should not appear at all (if you see them,
                   it's a safe bet that the document is encoded as UTF16, not UTF8):


The unicode number for a UTF8 sequence can be calculated as:

byte1                                                                            if 
byte1 <= 127
(byte1-192)*64 + (byte2-128)                                           if 192 <= byte1 
<= 223
(byte1-224)*4096 + (byte2-128)*64  + (byte3-128)              if 224 <= byte1 <= 239
(byte3-240)*262144 + (byte2-128)*4096 + (byte3-128)*64  + (byte4-128)
  if 240<= byte1  <= 247

Simple, eh?

-- 
groeten,

Taco
_______________________________________________
ntg-context mailing list
[EMAIL PROTECTED]
http://www.ntg.nl/mailman/listinfo/ntg-context

Reply via email to