> This reminds me: we shoul probably have some user-accessible method of > detection of the Unicode encodings, too: UTF-8 (well, this is really > guessing, at least without a BOM, "does this look like valid UTF-8 to > you"), but BOMs and UTF-16-foo, and UTF-32-foo.
Actually, UTF-8 autodetection can be pretty reliable (though I wouldn't recommend it for applications). The chances that a string in another encoding is completely free of malformed or overlong UTF-8 sequences are pretty small. UTF-8 has enough unique syntactic rigor to make it quite easily distinguishable from any other encoding. There's a detailed recommendation on BOM handling by encoding converters in http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf which I hope you will find helpful. There are further recommendations for the authors of Unicode encoding converters in http://www.cl.cam.ac.uk/~mgk25/unicode.html#conv especially on how to use field 5 in the Unicode database and the Unihan database to construct correct Unicode-to-somethingelse mapping tables. These sections summarise the intensive past discussions on these issues on the linux-utf8 mailing list. I hope that you are already familiar with them. If not, please read and consider the above sections carefully. Thanks! Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>