RE: HTML parsing

Jesse Pelton Tue, 05 May 2009 04:12:41 -0700

You might want to consider libxml2 (http://www.xmlsoft.org/) or its C++ 
wrapper, libxml++.  Since you mention browsers, you might also be able to tease 
out the parser from the source for Gecko, KHTML, or WebKit.


Note that parsing the "tag soup" HTML that makes up the Web is often a matter 
of guesswork because so much of it is poorly formed.  That's one reason 
browsers sometimes render the same page differently - they make different 
guesses as to the intent of the author of a poorly-formed page.  Adding that 
sort of heuristic to Xerces would considerably complicate the code and its 
maintenance.

-----Original Message-----
From: news on behalf of Piotr Dobrogost
Sent: Mon 5/4/2009 8:42 PM
To: [email protected]
Subject:  HTML parsing
 
Hi

I'd like to use Xerces to parse HTML.
As HTML is not XML I need to tweak Xerces so that it could transform
HTML into valid XML.
I found information about NekoHTML which is just what I need but it's in
Java...
Do you know if there's something like NekoHTML written in C/C++?
If you know better tool for this job than please let me know.

Thank you in advance for your time and help.

ps

I was very surprised with how little information I found on the topic of
parsing HTML with C++ in the Internet.
I was even more surprised with how little information on this topic I
found on this list.
Is there any reason for this?
How is this possible while so many browsers are written in C++?

-- 
Piotr Dobrogost
*** curlpp.org - c++ wrapper for libcurl ***

RE: HTML parsing

Reply via email to