Hi,

I have an existing application using Lucene that involves taking web pages, completely transmuting them into a form not at all recognizable as a natural language, and then searching through that form. In particular, we built our own Lucene analyzer/tokenizer that can read out our funky format. In order to take advantage of the Nutch crawler and Hadoop parallelization, we'd like to convert it.

I'm thinking that I could use the parser plugins that come with Nutch, write an indexer plugin that will take the text as recovered by the parsers, send it to an external process to produce the format that I need for indexing, replace the existing text in the document with my new text, and then index that. I believe I'd also need an analyzer plugin that uses the code we already have to tokenize our wierd format. When this is done, I believe I should have an index that I can use with our existing Lucene-based search code, without necessarily needing to convert the search part over to run via Nutch.

I have this nagging feeling that I'm going to be violating some deep-seated assumptions if I do this, so I'd appreciate any advice I could get.

Thanks,

Dave

Reply via email to