On Nov 3, 2009, at 12:06 AM, Guido van Rossum wrote:

On Mon, Nov 2, 2009 at 9:51 PM, sstein...@gmail.com <sstein...@gmail.com > wrote:
BeautifulSoup, which I use every day, is one such product. Since the crappy old SMGL parser's gone, BeautifulSoup uses the one that's left in Python 3
and it makes BeautifulSoup completely useless for my daily work.

This sounds an area where some help might be useful. Perhaps the
quickest solution would simply be to copy the old crappy "sgml" based
html parser into a new version of BeautifulSoup.

That is what we're discussing doing on the old-soup branch at http://github.com/adevore/old-beautiful-soup . I'm not exactly sure why the old SGML parser was dropped but it seems that porting it to Python 3 would be enough of an effort that it caused the Python library to drop it, and the current developer of the mainline of Beautiful Soup to decide to just use what was available in Python 3 natively.

Though I imagine what it really needs is a "quirks mode" parser that is compatible with the
HTML dialect accepted by, say, IE6. Maybe a summer of code project?

I think it just relies on the old SGML parser's not blowing up on completely bogus HTML (like most of the web) and does the best it can with the 'chunks' that come back; nothing to do with quirks mode per se.

As for a Summer of Code project, I have no idea what would be involved. I know there are lots of users for Beautiful soup; as far as I know it is the best scraper of HTML code, valid or not, that's out there and it's been around a long time and I see it in projects in the "html scraping" realm all the time.

At any rate, it's just one example of where the developer has taken the easy route out with a 3.0 port and has produced a product that's "Python 3" but, instead of getting better with Python's new features, has actually become useless for the majority of use-cases where formerly it shined.

S

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to