On Sat, Mar 5, 2011 at 2:06 PM, Dave Pawson <[email protected]> wrote:

> On Sat, 5 Mar 2011 03:01:06 +0530
> Kasun Gajasinghe <[email protected]> wrote:
>
> > There's some tools out there to parse dirty HTML tags and retrieve
> > it's whole content. But lot of good tools don't have a compatible
> > license with DocBook. Htmlcleaner looks like a good solution for
> > adding the support for indexing/searching *html* files though. So,
> > full support for html would come!
>
>
> tagsoup from John Cowan is my tool of choice for this.
> http://ccil.org/~cowan/XML/tagsoup/
> I even have a version which I use as the parser for input to Saxon
> which lets me process html as XML, using full xpath.
>

That's great Dave, Thanks. Previously, I had looked at Tagsoup at
http://java-source.net/open-source/html-parsers/tagsoup . There, the license
is listed as GPL, so I backed off. Didn't know that it's Apache Licensed
now. The Indexer (htmlsearch) in WebHelp is SAX-based, so Tagsoup would fit
perfectly because it's SAX-compliant. This will be a nice feature addition
for WebHelp IMO.

--Kasun

-- 
~~~*******'''''''''''''*******~~~
Kasun Gajasinghe,
University of Moratuwa,
Sri Lanka.
Blog: http://kasunbg.blogspot.com
Twitter: http://twitter.com/kasunbg

Reply via email to