I was considering not using nutch for indexing web documents.  I was thinking
either extracting the full HTML document or through the use of some kind of
web scraper html parser utility extracting only the text content from a web
page and then indexing that.

I know it is strange, but I feel I have more control on what gets indexed if I
use just Lucene.  Eg, I can add more fields and also I guarantee I will be
able to search what gets indexed.

Is this a bad approach or should I just use nutch?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to