On Sat, Mar 5, 2011 at 2:06 PM, Dave Pawson <[email protected]> wrote:
> On Sat, 5 Mar 2011 03:01:06 +0530 > Kasun Gajasinghe <[email protected]> wrote: > > > There's some tools out there to parse dirty HTML tags and retrieve > > it's whole content. But lot of good tools don't have a compatible > > license with DocBook. Htmlcleaner looks like a good solution for > > adding the support for indexing/searching *html* files though. So, > > full support for html would come! > > > tagsoup from John Cowan is my tool of choice for this. > http://ccil.org/~cowan/XML/tagsoup/ > I even have a version which I use as the parser for input to Saxon > which lets me process html as XML, using full xpath. > That's great Dave, Thanks. Previously, I had looked at Tagsoup at http://java-source.net/open-source/html-parsers/tagsoup . There, the license is listed as GPL, so I backed off. Didn't know that it's Apache Licensed now. The Indexer (htmlsearch) in WebHelp is SAX-based, so Tagsoup would fit perfectly because it's SAX-compliant. This will be a nice feature addition for WebHelp IMO. --Kasun -- ~~~*******'''''''''''''*******~~~ Kasun Gajasinghe, University of Moratuwa, Sri Lanka. Blog: http://kasunbg.blogspot.com Twitter: http://twitter.com/kasunbg
