[CCing python-dev again]

On 2008-04-22 12:38, Greg Wilson wrote:
I don't think that should be part of the standard library. People
will mistake what it tells them for certain.
[etc]

These are all good arguments, but the fact remains that we can't control our inputs (e.g., we're archiving mail messages sent to lists managed by DrProject), and some of those inputs *don't* tell us how they're encoded.
Under those circumstances, what would you recommend?

I haven't done much research into this, but in general, I think it's
better to:

 * first try to look at other characteristics of a text
   message, e.g. language, origin, topic, etc.,

 * then narrow down the number of encodings which could apply,

 * rank them to try to avoid ambiguities and

 * then try to see what percentage of the text you can decode using
   each of the encodings in reverse ranking order (ie. more specialized
   encodings should be tested first, latin-1 last).

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to