> IMHO, more research has to be done into this area before a > "standard" module can be added to the Python's stdlib... and > who knows, perhaps we're lucky and by the time everyone is > using UTF-8 anyway :-)
I walked over to our computational linguistics group and asked. This is often combined with language guessing (which uses a similar approach, but using characters instead of bytes), and apparently can usually be done with high confidence. Of course, they're usually looking at clean texts, not random "stuff". I'll see if I can get some references and report back -- most of the research on this was done in the 90's. Bill _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com