Look in the Lucene sandbox in CVS. I contributed an Ant task that indexed HTML documents. It uses JTidy under the covers to parse HTML into title and body content, and it could be extended to pull other information such <meta> keywords.

Erik


Leo Galambos wrote:
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of (2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but it looks complicated and I do not want to affect the results by ``robust'' parser.

THX

-g-


--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to