Hi,

Actually I have an indexing and search service that receives documents to be
indexed and search requests through XML-RPC. The server uses the Lucene
Search Engine and store it's index on the local File System. The documents
to be indexed are usually academic papers, so I'm not trying to index
large-scale data sets once in a Job, although the indexes may become very
large as more documents are received.

Now we are trying to parallelize the search and indexing by distributing the
index into shards on a cluster. I've been studying Hadoop but It's not clear
yet how to implement the system.

The basic design is:
1. We receive a document trough XML-RPC and it should be indexed in one
shard on the cluster.
2. We receive a query request trough XML-RPC and the query must be executed
over all shards and then the hits should be returned in a XML response.


My initial idea is:
1. Indexing
- The document received is used as input of one map. This function would
index the document on the local shard using our custom library build on top
of Lucene. There is no reduce.

2. Search
- The query received is used as input of the map function. This function
would search the document on the local shard using our custom library and
emit the hits. The reduce function would group the hits from all shards.


Is it possible to implement that using Hadoop MapReduce framework?
Implementing custom InputFormats and OutputFormats?
Or should I use the Hadoop RPC Layer? There's any documentation about  it?
Any suggestions?

Thanks,
Aécio Santos.

-- 
Instituto Federal de Ciência, Educação e Tecnologia do Piauí
Laboratório de Pesquisa em Sistemas de Informação
Teresina - Piauí - Brazil

Reply via email to