The stdlib HTML parser requires correct HTML.

To parse broken HTML, as you find in the real world, you need a third-party 
library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times 
as many LOC) but can handle nearly anything a browser can.

I doubt the stdlib will ever compete with BeautifulSoup.

