On Thu, May 13, 2010 at 2:41 PM, Aécio <[email protected]> wrote:
> Hi, > > Actually I have an indexing and search service that receives documents to > be > indexed and search requests through XML-RPC. The server uses the Lucene > Search Engine and store it's index on the local File System. The documents > to be indexed are usually academic papers, so I'm not trying to index > large-scale data sets once in a Job, although the indexes may become very > large as more documents are received. > > Now we are trying to parallelize the search and indexing by distributing > the > index into shards on a cluster. I've been studying Hadoop but It's not > clear > yet how to implement the system. > > The basic design is: > 1. We receive a document trough XML-RPC and it should be indexed in one > shard on the cluster. > 2. We receive a query request trough XML-RPC and the query must be executed > over all shards and then the hits should be returned in a XML response. > > > My initial idea is: > 1. Indexing > - The document received is used as input of one map. This function would > index the document on the local shard using our custom library build on top > of Lucene. There is no reduce. > > 2. Search > - The query received is used as input of the map function. This function > would search the document on the local shard using our custom library and > emit the hits. The reduce function would group the hits from all shards. > > > Is it possible to implement that using Hadoop MapReduce framework? > Implementing custom InputFormats and OutputFormats? > Or should I use the Hadoop RPC Layer? There's any documentation about it? > Any suggestions? > > Thanks, > Aécio Santos. > > -- > Instituto Federal de Ciência, Educação e Tecnologia do Piauí > Laboratório de Pesquisa em Sistemas de Informação > Teresina - Piauí - Brazil > You can inplement some of what you have described. Take a look at these if you have not already they are concrete implementations for building and searching distributed indexes. http://lucene.apache.org/nutch/ http://katta.sourceforge.net/ (not hadoop based but still pretty cool) http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
