Re: [Python-Dev] Encoding detection in the standard library?

Mike Klaas Tue, 22 Apr 2008 16:10:55 -0700


On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:

Any program that needs to examine the contents of
documents/feeds/whatever on the web needs to deal with
incorrectly-specified encodings


That's not true. Most programs that need to examine the contents of
a web page don't need to guess the encoding. In most such programs,
the encoding can be hard-coded if the declared encoding is not
correct. Most such programs *know* what page they are webscraping,
or else they couldn't extract the information out of it that they
want to get at.

I certainly agree that if the target set of documents is small enoughit is possible to hand-code the encoding. There are manyapplications, however, that need to examine the content of anarbitrary, or at least non-small set of web documents. To name a fewsuch applications:


 - web search engines
 - translation software
 - document/bookmark management systems
 - other kinds of document analysis (market research, seo, etc.)

As for feeds - can you give examples of incorrectly encoded one
(I don't ever use feeds, so I honestly don't know whether they
are typically encoded incorrectly. I've heard they are often XML,
in which case I strongly doubt they are incorrectly encoded)

I also don't have much experience with feeds. My statement is basedon the fact that chardet, the tool that has been cited most in thisthread, was written specifically for use with the author's feedparsing package.

As for "whatever" - can you give specific examples?

Not that I can substantiate. Documents & feeds covers a lot of whatis on the web--I was only trying to make the point that on the web,whenever an encoding can be specified, it will be specifiedincorrectly for a significant chunk of exemplars.

(which, sadly, is rather common). The
set of programs of programs that need this functionality isprobably thesame set that needs BeautifulSoup--I think that set is larger thanjust
browsers <grin>
Again, can you give *specific* examples that are not web browsers?
Programs needing BeautifulSoup may still not need encoding guessing,
since they still might be able to hard-code the encoding of the web
page they want to process.

Indeed, if it is only one site it is pretty easy to work around. Mymain use of python is processing and analyzing hundreds of millions ofweb documents, so it is pretty easy to see applications (which I havelisted above). I think that libraries like Mark Pilgrim's FeedParserand BeautifulSoup are possible consumers of guessing as well.

In any case, I'm very skeptical that a general "guess encoding"
module would do a meaningful thing when applied to incorrectly
encoded HTML pages.

Well, it does. I wish I could easily provide data on how often it isnecessary over the whole web, but that would be somewhat difficult togenerate. I can say that it is much more important to be able toparse all the different kinds of encoding _specification_ on the web(Content-Type/Content-Encoding/<meta http-equiv tags, etc), and themalformed cases of these.

I can also think of good arguments for excluding encoding detectionfor maintenance reasons: is every case of the algorithm guessing wronga bug that needs to be fixed in the stdlib? That is an unboundedcommitment.


-Mike
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Encoding detection in the standard library?

Reply via email to