I am using irc a lot, and I am trying to make an intelligent charset
recogntion algorithm.
My algorithm so far is the following (apply to every message)
* Check the message for 8-bit characters
if none ->
* Check the message for {}| inside other text (this should be configurable)
if any -> text is in iso-646-se (or -dk or -de or ...)
else -> text is in iso-646-us (UTF-8 is just as good of course)
* Check the message for illegal UTF-8 sequences
if none -> text is in UTF-8
else -> text is in iso-8859-1
What are your thoughts on such a scheme? Of course we should be aiming
for UTF-8, but not all our friends are there yet.
n.
--
[ http://www.dtek.chalmers.se/~d95mback/ ] [ PGP: 0x453504F1 ] [ UIN: 4439498 ]
Opinions expressed above are mine, and not those of my future employees.
Skingra er! Det finns ingenting att f�rst�!
SIGBORE: Signature boring error, core dumped
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/