Probably a good idea to swap something else in, although Neko introduces a dependency on Xerces. I didn't play with Neko because I am currently using a different XML parser and didn't want to deal with the conflicts (and also find dependencies on specific parsers annoying). However, yesterday I downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great! It is small and fast and so far has parsed every page I've thrown at it. I wrote a SAX ContentHandler that only grabs the text and does a few other little things (like inserting spaces, removing tabs/line feeds, grabbing title) and it seems to be a perfect fit for the job. It requires the SAX framework, but is parser independent. The only tweak I made to the TagSoup code was to add an "else" to deal with a bug where it was consuming ";" after entities that it did not deal with.
If Neko is potentially headed into the Apache fold, that probably makes sense. But if you are interested in my TagSoup ContentHandler for testing it out, just let me know.
-Mike
At 08:08 PM 9/19/2003 -0400, you wrote:
I'm going to swap in the neko HTML parser for the demo refactorings I'm doing. I would be all for replacing the demo HTML parser with this.
If you look at the Ant <index> task in the sandbox, you'll see that I used JTidy for it and it works well, but I've heard that neko is faster and better so I'll give it a try.
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
