Re: map/reduce and Lucene integration question

Enis Soztutar Thu, 13 Dec 2007 01:37:03 -0800

Hi,

nutch indexes the documents in the org.apache.nutch.indexer.Indexerclass. In the reduce phase, the documents are output wrapped inObjectWritable. The OutputFormat opens a localindexwriter(FileSystem.startLocalOutput()), and adds all the documentsthat are collected. Then puts the index indfs(FileSystem.completeLocalOutput()). The resulting index hasnumReducer partitions.


Eugeny N Dzhurinsky wrote:

Hello!

We would like to use Hadoop to index a lot of documents, and we would like to
have this index in the Lucene and utilize Lucene's search engine power for
searching.

At this point I am confused a bit - when we will analyze documents in Map
part, we will end with
- document name/location
- list of name/value pairs to be indexed by Lucene somehow

As far as I know I can write same key and different value to the
OutputCollector, however I'm not sure how can I pass list of name/value pairs
to the collector, or I need to think in some different way?

Another question is how can I write Lucene index in reduce part, since as far
as I know reduce can be invoked on any computer in cluster while Lucene index
requires to have non-DFS filesystem to store it's indexes and helper files?

I heard about Nutch it can use Map/Reduce to idnex documents and store them in
Lucene index, however quick look at it's sources didn't give me solid view of
how is it doing this and is it doing in this way I described at all?

Probably I'm missing something, so could somebody please point me to right
direction?

Thank you in advance!

Re: map/reduce and Lucene integration question

Reply via email to