Martin Norb�ck wrote:

> I am using irc a lot, and I am trying to make an intelligent charset
> recogntion algorithm.
>
> * Check the message for 8-bit characters
>   if none ->
>     * Check the message for {}| inside other text (this should be configurable)
>       if any -> text is in iso-646-se (or -dk or -de or ...)
>       else   -> text is in iso-646-us (UTF-8 is just as good of course)
> * Check the message for illegal UTF-8 sequences
>   if none -> text is in UTF-8
>   else    -> text is in iso-8859-1
>
> What are your thoughts on such a scheme? Of course we should be aiming
> for UTF-8, but not all our friends are there yet.

Maybe a more general test would be to have a set of functions that send back a
value that says how likely the text is to be a given encoding, and to choose at
the end the most positive result ?
Doing this will make it much easier to add additional charsets later.

And there will be a demand for that sooner or later.
When I connect on an IRC server, it's not difficult to find channels that use
japanese encoding (iso-2022-jp). I can imagine they are russian, chinese, etc...
users too.

Will your code run on the server or on the client ?
You will be confronted with the problem that your code needs to be transparent for
encodings you do not recognize.

How long will be the text on which you have to decide the charset ?
Will you need to auto-detect for each message transferred ?

I can give an additonal hint, if the message has several characters over 0x80 in a
row, or too many characters over 0x80, it's very probably not ISO-8859-1.
This is not necessarily true for other iso-8859-x charsets, for exemple ISO-8859-7
(greek).

Sequences of ESC-x at start of line, ESC-x at end of line are typical of
iso-2022 derived encodings, but I'm afraid many irc client will send local
encoding in 8 bit rather than an iso-2022 version that's easy to auto-detect.

Mozilla/Netscape 6 has some code to auto-detect charset, unfortunately it's not
completely generic.

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to