Lucene or nutch for indexing web documents

bbrown Tue, 27 Nov 2007 15:13:37 -0800

I was considering not using nutch for indexing web documents.  I was thinking
either extracting the full HTML document or through the use of some kind of
web scraper html parser utility extracting only the text content from a web
page and then indexing that.


I know it is strange, but I feel I have more control on what gets indexed if I
use just Lucene.  Eg, I can add more fields and also I guarantee I will be
able to search what gets indexed.

Is this a bad approach or should I just use nutch?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene or nutch for indexing web documents

Reply via email to