On Sat, 5 Mar 2011 03:01:06 +0530
Kasun Gajasinghe <[email protected]> wrote:

> There's some tools out there to parse dirty HTML tags and retrieve
> it's whole content. But lot of good tools don't have a compatible
> license with DocBook. Htmlcleaner looks like a good solution for
> adding the support for indexing/searching *html* files though. So,
> full support for html would come!


tagsoup from John Cowan is my tool of choice for this.
http://ccil.org/~cowan/XML/tagsoup/
I even have a version which I use as the parser for input to Saxon
which lets me process html as XML, using full xpath.

HTH



-- 

regards 

-- 
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to