I am using irc a lot, and I am trying to make an intelligent charset
recogntion algorithm.

My algorithm so far is the following (apply to every message)

* Check the message for 8-bit characters
  if none ->
    * Check the message for {}| inside other text (this should be configurable)
      if any -> text is in iso-646-se (or -dk or -de or ...)
      else   -> text is in iso-646-us (UTF-8 is just as good of course)
* Check the message for illegal UTF-8 sequences
  if none -> text is in UTF-8
  else    -> text is in iso-8859-1

What are your thoughts on such a scheme? Of course we should be aiming
for UTF-8, but not all our friends are there yet.

        n.

-- 
[ http://www.dtek.chalmers.se/~d95mback/ ] [ PGP: 0x453504F1 ] [ UIN: 4439498 ]
    Opinions expressed above are mine, and not those of my future employees.
                  Skingra er! Det finns ingenting att f�rst�!
SIGBORE: Signature boring error, core dumped
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to