Martin Norb�ck wrote on 2000-10-04 10:00 UTC:
> * Check the message for 8-bit characters
> if none ->
> * Check the message for {}| inside other text (this should be configurable)
> if any -> text is in iso-646-se (or -dk or -de or ...)
> else -> text is in iso-646-us (UTF-8 is just as good of course)
This is a very crude hack and will fail with a significant rate. I'd
suggest that only UTF-8 and ISO 8859-1 should be autodetected. Are
national IRV variants still used that widely here that such a hack with
guaranteed bad side-effects has to be recommended? I personally doubt
it. Practically nobody uses hardware that doesn't support ISO 8859-1
these days and an ISO 646-SE autodetector is far more likely to become a
part of the problem than a part of the solution. ISO 646 died sometimes
in the late 1980s as far as I can tell.
> * Check the message for illegal UTF-8 sequences
> if none -> text is in UTF-8
> else -> text is in iso-8859-1
That should be fairly practical and reliable to do. Would it be worth to
assume that the text is in CP1252 instead of ISO 8859-1 if the UTF-8
test fails?
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/