mod_pagespeed's event-driven HTML parser is open source, and is written in C++: http://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/htmlparse/public/html_parse.h
<http://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/htmlparse/public/html_parse.h>This parser is tested using HTML from large numbers of web sites. The build process for this module ( http://code.google.com/p/modpagespeed/wiki/HowToBuild) generates a separate .a for the HTML parser, although it's got a few dependencies that would need to be linked in. These are all included in mod_pagespeed.so which is self-contained but larger. If there was much interest we could try to try to package up a self-contained library that would make it easier to call from other modules. See also libxml2, which has an HTML mode. -Josh On Fri, Mar 25, 2011 at 9:28 AM, MK <m...@cognitivedissonance.ca> wrote: > On Thu, 24 Mar 2011 20:10:46 +0800 (CST) > Whut Jia <whut_...@163.com> wrote: > > Hi,all > > I want to parse a html content and withdraw some element in myself > > apache handler.Please ask how to do it. Thanks, > > Jia > > I think right now the only public C library for parsing html is in the > venerable and long unmaintained libwww. > > However, I wrote a quick and simple, event driven parser library a few > months ago -- I have been meaning to open source this on CCAN or > somewhere but have not gotten around to it, so if you are interested > you can send me a message directly, I have some basic scraper demos > etc. It is not on the scale of libwww -- it is just a low level HTML > parser -- but I am sure it could do what you want, and you can either > compile it in or link to with an apache module (it has no further > dependencies). > > > -- > "Enthusiasm is not the enemy of the intellect." (said of Irving Howe) > "The angel of history[...]is turned toward the past." (Walter Benjamin) > >