Re: [Python-Dev] Encoding detection in the standard library?

M.-A. Lemburg Tue, 22 Apr 2008 13:58:51 -0700

[CCing python-dev again]

On 2008-04-22 12:38, Greg Wilson wrote:

I don't think that should be part of the standard library. People
will mistake what it tells them for certain.
[etc]
These are all good arguments, but the fact remains that we can't controlour inputs (e.g., we're archiving mail messages sent to lists managed byDrProject), and some of those inputs *don't* tell us how they're encoded.
Under those circumstances, what would you recommend?


I haven't done much research into this, but in general, I think it's
better to:

 * first try to look at other characteristics of a text
   message, e.g. language, origin, topic, etc.,

 * then narrow down the number of encodings which could apply,

 * rank them to try to avoid ambiguities and

 * then try to see what percentage of the text you can decode using
   each of the encodings in reverse ranking order (ie. more specialized
   encodings should be tested first, latin-1 last).

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)

Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Encoding detection in the standard library?

Reply via email to