Jesse Pelton wrote: > You might want to consider libxml2 (http://www.xmlsoft.org/) or its C++ > wrapper, libxml++. Since you mention browsers, you might also be able to > tease out the parser from the source for Gecko, KHTML, or WebKit.
Thanks Jesse for these suggestions. > Note that parsing the "tag soup" HTML that makes up the Web is often a matter > of guesswork (...) Agree. Adding that sort of heuristic to Xerces would considerably complicate the code and its maintenance. >From what I see NekoHTML uses some Xerces API and doesn't _modify_ Xerces but _uses_ it. I can't believe there's no something like NekoHTML written in C++ for Xerces-C++. The amount of HTML is huge comparing with XML and people are left with no good tool to work with HTML. Having C++ version of NekoHTML would make Xercec-C++ even more popular and valuable. -- Piotr Dobrogost *** curlpp.org - c++ wrapper for libcurl ***
