You might want to consider libxml2 (http://www.xmlsoft.org/) or its C++ wrapper, libxml++. Since you mention browsers, you might also be able to tease out the parser from the source for Gecko, KHTML, or WebKit.
Note that parsing the "tag soup" HTML that makes up the Web is often a matter of guesswork because so much of it is poorly formed. That's one reason browsers sometimes render the same page differently - they make different guesses as to the intent of the author of a poorly-formed page. Adding that sort of heuristic to Xerces would considerably complicate the code and its maintenance. -----Original Message----- From: news on behalf of Piotr Dobrogost Sent: Mon 5/4/2009 8:42 PM To: [email protected] Subject: HTML parsing Hi I'd like to use Xerces to parse HTML. As HTML is not XML I need to tweak Xerces so that it could transform HTML into valid XML. I found information about NekoHTML which is just what I need but it's in Java... Do you know if there's something like NekoHTML written in C/C++? If you know better tool for this job than please let me know. Thank you in advance for your time and help. ps I was very surprised with how little information I found on the topic of parsing HTML with C++ in the Internet. I was even more surprised with how little information on this topic I found on this list. Is there any reason for this? How is this possible while so many browsers are written in C++? -- Piotr Dobrogost *** curlpp.org - c++ wrapper for libcurl ***
