Re: intelligent charset recognition for irc

Markus Kuhn Wed, 04 Oct 2000 03:49:41 -0700
Martin Norb�ck wrote on 2000-10-04 10:00 UTC:
> * Check the message for 8-bit characters
>   if none ->
>     * Check the message for {}| inside other text (this should be configurable)
>       if any -> text is in iso-646-se (or -dk or -de or ...)
>       else   -> text is in iso-646-us (UTF-8 is just as good of course)

This is a very crude hack and will fail with a significant rate. I'd
suggest that only UTF-8 and ISO 8859-1 should be autodetected. Are
national IRV variants still used that widely here that such a hack with
guaranteed bad side-effects has to be recommended? I personally doubt
it. Practically nobody uses hardware that doesn't support ISO 8859-1
these days and an ISO 646-SE autodetector is far more likely to become a
part of the problem than a part of the solution. ISO 646 died sometimes
in the late 1980s as far as I can tell.

> * Check the message for illegal UTF-8 sequences
>   if none -> text is in UTF-8
>   else    -> text is in iso-8859-1

That should be fairly practical and reliable to do. Would it be worth to
assume that the text is in CP1252 instead of ISO 8859-1 if the UTF-8
test fails?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: intelligent charset recognition for irc

Reply via email to