On Fri, 08 Jan 2010 11:44:48 +0800, Water Lin wrote: > I am a new guy to use Python, but I want to parse a html page now. I > tried to use HTMLParse. Here is my sample code: > ---------------------- > from HTMLParser import HTMLParser
Note that HTMLParser only tokenises HTML; it doesn't actually *parse* it. You just get a stream of tag, text, entity, text, tag, ..., not a parse tree. In particular, if an element has its start and/or end tags omitted, you won't get any notification about the start and/or end of the element; you have to figure that out yourself from the fact that you're getting a tag which wouldn't be allowed outside or inside the element. E.g. if the document has omitted </p> tags, if you get a <p> tag when you are (or *thought* that you were) already within a paragraph, you can infer the omitted </p> tag. If you want an actual parser, look at BeautifulSoup. This also does a good job of handling invalid HTML (which seems to be far more common than genuine HTML). -- http://mail.python.org/mailman/listinfo/python-list