> IMHO, more research has to be done into this area before a
> "standard" module can be added to the Python's stdlib... and
> who knows, perhaps we're lucky and by the time everyone is
> using UTF-8 anyway :-)

I walked over to our computational linguistics group and asked.  This
is often combined with language guessing (which uses a similar
approach, but using characters instead of bytes), and apparently can
usually be done with high confidence.  Of course, they're usually
looking at clean texts, not random "stuff".  I'll see if I can get
some references and report back -- most of the research on this was
done in the 90's.

Bill
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to