Re: which HTML parser is better?

aurora Thu, 03 Feb 2005 08:30:36 -0800

For all parser suggestion I think there is one important attribute. Some parsers returns data provide that the input HTML is sensible. Some parsers is designed to be most flexible as tolerant as it can be. If the input is clean and controlled the former class is sufficient. Even some regular expression may be sufficient. (I that's the original poster wants). If you are building a web crawler you need something really tolerant.

Once I have prototyped a nice and fast parser. Later I have to abandon it because it failed to parse about 15% documents (problem handling nested quotes like onclick="alert('hi')").

No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered.
On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

Reply via email to