Sanity Check re: Converting customized Lucene crawl/index to use Nutch

Dave Schneider Tue, 23 Oct 2007 14:34:11 -0700

Hi,

I have an existing application using Lucene that involves taking webpages, completely transmuting them into a form not at all recognizableas a natural language, and then searching through that form. Inparticular, we built our own Lucene analyzer/tokenizer that can read outour funky format. In order to take advantage of the Nutch crawler andHadoop parallelization, we'd like to convert it.

I'm thinking that I could use the parser plugins that come with Nutch,write an indexer plugin that will take the text as recovered by theparsers, send it to an external process to produce the format that Ineed for indexing, replace the existing text in the document with my newtext, and then index that. I believe I'd also need an analyzer pluginthat uses the code we already have to tokenize our wierd format.When this is done, I believe I should have an index that I can use withour existing Lucene-based search code, without necessarily needing toconvert the search part over to run via Nutch.

I have this nagging feeling that I'm going to be violating somedeep-seated assumptions if I do this, so I'd appreciate any advice Icould get.


Thanks,

Dave

Sanity Check re: Converting customized Lucene crawl/index to use Nutch

Reply via email to