On Nov 3, 2009, at 12:06 AM, Guido van Rossum wrote:
On Mon, Nov 2, 2009 at 9:51 PM, sstein...@gmail.com <sstein...@gmail.com
> wrote:
BeautifulSoup, which I use every day, is one such product. Since
the crappy
old SMGL parser's gone, BeautifulSoup uses the one that's left in
Python 3
and it makes BeautifulSoup completely useless for my daily work.
This sounds an area where some help might be useful. Perhaps the
quickest solution would simply be to copy the old crappy "sgml" based
html parser into a new version of BeautifulSoup.
That is what we're discussing doing on the old-soup branch at http://github.com/adevore/old-beautiful-soup
. I'm not exactly sure why the old SGML parser was dropped but it
seems that porting it to Python 3 would be enough of an effort that it
caused the Python library to drop it, and the current developer of the
mainline of Beautiful Soup to decide to just use what was available in
Python 3 natively.
Though I imagine what it really needs is a "quirks mode" parser that
is compatible with the
HTML dialect accepted by, say, IE6. Maybe a summer of code project?
I think it just relies on the old SGML parser's not blowing up on
completely bogus HTML (like most of the web) and does the best it can
with the 'chunks' that come back; nothing to do with quirks mode per se.
As for a Summer of Code project, I have no idea what would be
involved. I know there are lots of users for Beautiful soup; as far
as I know it is the best scraper of HTML code, valid or not, that's
out there and it's been around a long time and I see it in projects in
the "html scraping" realm all the time.
At any rate, it's just one example of where the developer has taken
the easy route out with a 3.0 port and has produced a product that's
"Python 3" but, instead of getting better with Python's new features,
has actually become useless for the majority of use-cases where
formerly it shined.
S
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com