I was considering not using nutch for indexing web documents. I was thinking either extracting the full HTML document or through the use of some kind of web scraper html parser utility extracting only the text content from a web page and then indexing that.
I know it is strange, but I feel I have more control on what gets indexed if I use just Lucene. Eg, I can add more fields and also I guarantee I will be able to search what gets indexed. Is this a bad approach or should I just use nutch? -- Berlin Brown [berlin dot brown at gmail dot com] http://botspiritcompany.com/botlist/? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]