Re: Encode-0.40.tar.gz

Markus Kuhn Sat, 16 Feb 2002 08:37:11 -0800

> This reminds me: we shoul probably have some user-accessible method of
> detection of the Unicode encodings, too: UTF-8 (well, this is really
> guessing, at least without a BOM, "does this look like valid UTF-8 to
> you"), but BOMs and UTF-16-foo, and UTF-32-foo.


Actually, UTF-8 autodetection can be pretty reliable (though I wouldn't
recommend it for applications). The chances that a string in another
encoding is completely free of malformed or overlong UTF-8 sequences are
pretty small. UTF-8 has enough unique syntactic rigor to make it quite
easily distinguishable from any other encoding.

There's a detailed recommendation on BOM handling by encoding converters
in

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

which I hope you will find helpful.

There are further recommendations for the authors of Unicode encoding
converters in

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#conv

especially on how to use field 5 in the Unicode database and the Unihan
database to construct correct Unicode-to-somethingelse mapping tables.
These sections summarise the intensive past discussions on these issues
on the linux-utf8 mailing list. I hope that you are already familiar
with them. If not, please read and consider the above sections
carefully. Thanks!

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Re: Encode-0.40.tar.gz

Reply via email to