UTF8 encoding is rather simple, really:
byte number:
b1 b2 b3 b4
0 -- 127 = unicode 0x00 -
0x7F
192 -- 223 128 -- 191 = unicode 0x80 - 0x7FF
224 -- 239 128 -- 191 128 -- 191 = unicode 0x800 - 0xFFFF
240 -- 247 128 -- 191 128 -- 191 128 -- 191 = unicode 0x10000 - 0x1FFFF
There are also sequences for 5 and 6 bytes, but these are illegal for Unicode
representations at the moment:
248 -- 251 128 -- 191 128 -- 191 128 -- 191 128 -- 191
252 -- 253 128 -- 191 128 -- 191 128 -- 191 128 -- 191 128 -- 191
128 -- 191 are illegal as first chars in UTF8 (that is handy for error-recovery):
254 and 255 are completely illegal and should not appear at all (if you see them,
it's a safe bet that the document is encoded as UTF16, not UTF8):
The unicode number for a UTF8 sequence can be calculated as:
byte1 if
byte1 <= 127
(byte1-192)*64 + (byte2-128) if 192 <= byte1
<= 223
(byte1-224)*4096 + (byte2-128)*64 + (byte3-128) if 224 <= byte1 <= 239
(byte3-240)*262144 + (byte2-128)*4096 + (byte3-128)*64 + (byte4-128)
if 240<= byte1 <= 247
Simple, eh?
--
groeten,
Taco
_______________________________________________
ntg-context mailing list
[EMAIL PROTECTED]
http://www.ntg.nl/mailman/listinfo/ntg-context