Hi, Actually I have an indexing and search service that receives documents to be indexed and search requests through XML-RPC. The server uses the Lucene Search Engine and store it's index on the local File System. The documents to be indexed are usually academic papers, so I'm not trying to index large-scale data sets once in a Job, although the indexes may become very large as more documents are received.
Now we are trying to parallelize the search and indexing by distributing the index into shards on a cluster. I've been studying Hadoop but It's not clear yet how to implement the system. The basic design is: 1. We receive a document trough XML-RPC and it should be indexed in one shard on the cluster. 2. We receive a query request trough XML-RPC and the query must be executed over all shards and then the hits should be returned in a XML response. My initial idea is: 1. Indexing - The document received is used as input of one map. This function would index the document on the local shard using our custom library build on top of Lucene. There is no reduce. 2. Search - The query received is used as input of the map function. This function would search the document on the local shard using our custom library and emit the hits. The reduce function would group the hits from all shards. Is it possible to implement that using Hadoop MapReduce framework? Implementing custom InputFormats and OutputFormats? Or should I use the Hadoop RPC Layer? There's any documentation about it? Any suggestions? Thanks, Aécio Santos. -- Instituto Federal de Ciência, Educação e Tecnologia do Piauí Laboratório de Pesquisa em Sistemas de Informação Teresina - Piauí - Brazil
