Once I have prototyped a nice and fast parser. Later I have to abandon it because it failed to parse about 15% documents (problem handling nested quotes like onclick="alert('hi')").
No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered.
On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function?
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]