On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:
On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:

Note, however, that html5lib is likely way too big to add it to the
stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
in Python 3, which would be the target release series for better HTML
support. So, whatever library or API you would want to use for HTML
processing is currently only the second question as long as Py3 lacks
a real-world HTML parser in the stdlib, as well as a robust character
detection mechanism. I don't think that can be fixed all that easily.

Here's the problem in a nutshell, I think:

 1. Everybody wants an HTML parser in the stdlib, because it's
    inconvenient to pull in a dependency for such a "simple" task.
 2. Everybody wants the stdlib to remain small, stable, and simple and
    not get "overcomplicated".
 3. Parsing arbitrary HTML5 is a monstrously complex problem, for which
    there exist rapidly-evolving standards and libraries to deal with
    it. Parsing 'the web' (which is rapidly growing to include stuff
    like SVG, MathML etc) is even harder.


My personal opinion is that HTML5Lib gets this problem almost completely
right, and so it should be absorbed by the stdlib.

A little data: the HTML5lib project lives at
https://code.google.com/p/html5lib/
It has 4 owners and 22 other committers.

The most recent release, html5lib 0.90 for Python, is nearly 2 years old. Since there is a separate Python3 repository, and there is no mention on Python3 compatibility elsewhere that I saw, including the pypi listing, I assume that is for Python2 only.

A comment on a recent (July 11) Python3 issue
https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
suggest that the Python3 version still has problems. "Merged in now, though still lots of errors and failures in the testsuite."

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to