On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:

Any program that needs to examine the contents of
documents/feeds/whatever on the web needs to deal with
incorrectly-specified encodings

That's not true. Most programs that need to examine the contents of
a web page don't need to guess the encoding. In most such programs,
the encoding can be hard-coded if the declared encoding is not
correct. Most such programs *know* what page they are webscraping,
or else they couldn't extract the information out of it that they
want to get at.

I certainly agree that if the target set of documents is small enough it is possible to hand-code the encoding. There are many applications, however, that need to examine the content of an arbitrary, or at least non-small set of web documents. To name a few such applications:

 - web search engines
 - translation software
 - document/bookmark management systems
 - other kinds of document analysis (market research, seo, etc.)

As for feeds - can you give examples of incorrectly encoded one
(I don't ever use feeds, so I honestly don't know whether they
are typically encoded incorrectly. I've heard they are often XML,
in which case I strongly doubt they are incorrectly encoded)

I also don't have much experience with feeds. My statement is based on the fact that chardet, the tool that has been cited most in this thread, was written specifically for use with the author's feed parsing package.

As for "whatever" - can you give specific examples?

Not that I can substantiate. Documents & feeds covers a lot of what is on the web--I was only trying to make the point that on the web, whenever an encoding can be specified, it will be specified incorrectly for a significant chunk of exemplars.

(which, sadly, is rather common). The
set of programs of programs that need this functionality is probably the same set that needs BeautifulSoup--I think that set is larger than just
browsers <grin>

Again, can you give *specific* examples that are not web browsers?
Programs needing BeautifulSoup may still not need encoding guessing,
since they still might be able to hard-code the encoding of the web
page they want to process.

Indeed, if it is only one site it is pretty easy to work around. My main use of python is processing and analyzing hundreds of millions of web documents, so it is pretty easy to see applications (which I have listed above). I think that libraries like Mark Pilgrim's FeedParser and BeautifulSoup are possible consumers of guessing as well.

In any case, I'm very skeptical that a general "guess encoding"
module would do a meaningful thing when applied to incorrectly
encoded HTML pages.

Well, it does. I wish I could easily provide data on how often it is necessary over the whole web, but that would be somewhat difficult to generate. I can say that it is much more important to be able to parse all the different kinds of encoding _specification_ on the web (Content-Type/Content-Encoding/<meta http-equiv tags, etc), and the malformed cases of these.

I can also think of good arguments for excluding encoding detection for maintenance reasons: is every case of the algorithm guessing wrong a bug that needs to be fixed in the stdlib? That is an unbounded commitment.

-Mike
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to