Stefan Behnel wrote: > I would have a hard time feeling happy > if a real-world HTML parser was added to the stdlib that provides a totally > different interface than the best (and fastest) XML library that the stdlib > currently has.
I doubt there would be any objection to someone contributing wrappers for upgrades, but I wouldn't count on them being used. lxml may well be the best choice for xml. BeautifulSoup and html5lib wouldn't even exist if that actually mattered for most of *their* use cases. Think of them more as pre-processors, like tidylib. If enough web pages were even valid HTML (let alone valid and well-formed XML), no one would have bothered to write these libraries. BeautifulSoup has the advantage of being long-proven in practice, for ugly html. (You mention an lxml feature with a similar intent, but for lxml, it is one of several addon features; for BeautifulSoup, this is the whole point.) html5lib does not have as long of a history, but it does have the advantage of being almost an endorsed standard. Much of HTML 5 is documenting the workarounds that browser makers already actually employ to handle erroneous input, so that the complexities can at least stop compounding. html5lib is intended as a reference implementation, and the w3c editor has used it to motivate changes in the specification draft. (This may make it unsuitable for inclusion in the stdlib today, because of timing issues.) In other words, it isn't just the heuristics of one particular development team; it is (modulo bugs, and after official publication) the heuristics that the major web browser makers have agreed to treat as "correct" in the future. -jJ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com