On Thu, Jun 21, 2007 at 02:50:02PM -0700, Paul Ramsey wrote: > You're right and I'm wrong, I was confused by the UTF code numbers, > which differ from the actual byte encodings used for UTF8. Indeed, > all the multi-byte higher-order stuff is stuffed into 128-255 in the > UTF8 encoding, so a straight byte-swap would work (for UTF8 and the > various one-byte latin code pages, that is).
Additionally, leading and trailing bytes of multibyte UTF-8 sequences use disparate ranges and the value of the leading byte indicates how many trailing bytes follow. Section 2.5 of The Unicode Standard discusses encoding form design principles; Section 3.9 contains formal definitions. Table 3-7 shows the byte ranges allowed in each position (single, leading, trailing). http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf -- Michael Fuhr _______________________________________________ postgis-users mailing list [email protected] http://postgis.refractions.net/mailman/listinfo/postgis-users
