Re: HTML Parsing problems...

Andrzej Bialecki Mon, 22 Sep 2003 05:41:24 -0700

Michael Giles wrote:

Erik,

Probably a good idea to swap something else in, although Neko introduces a dependency on Xerces. I didn't play with Neko because I am currently using a different XML parser and didn't want to deal with the conflicts (and also find dependencies on specific parsers annoying). However, yesterday I downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great! It is small and fast and so far has parsed every page I've thrown at it. I wrote a SAX ContentHandler that only grabs the text and does a few other little things (like inserting spaces, removing tabs/line feeds, grabbing title) and it seems to be a perfect fit for the job. It requires the SAX framework, but is parser independent. The only tweak I made to the TagSoup code was to add an "else" to deal with a bug where it was consuming ";" after entities that it did not deal with.

TagSoup is great - however, it is not maintained nor developed (the same could be said about JTidy as well, but TagSoup's history is much shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) for my application, and it also works very well, even for ill-formed input. It's also very actively developed.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML Parsing problems...

Reply via email to