I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck > -----Original Message----- > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-created by MS-word 'Save As HTML files' function? > > _________________________________________________________ > Do You Yahoo!? > 150万曲MP3疯狂搜,带您闯入音乐殿堂 > http://music.yisou.com/ > 美女明星应有尽有,搜遍美图、艳图和酷图 > http://image.yisou.com > 1G就是1000兆,雅虎电邮自助扩容! > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > il_1g/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]