On Thu, May 13, 2010 at 2:41 PM, Aécio <[email protected]> wrote:

> Hi,
>
> Actually I have an indexing and search service that receives documents to
> be
> indexed and search requests through XML-RPC. The server uses the Lucene
> Search Engine and store it's index on the local File System. The documents
> to be indexed are usually academic papers, so I'm not trying to index
> large-scale data sets once in a Job, although the indexes may become very
> large as more documents are received.
>
> Now we are trying to parallelize the search and indexing by distributing
> the
> index into shards on a cluster. I've been studying Hadoop but It's not
> clear
> yet how to implement the system.
>
> The basic design is:
> 1. We receive a document trough XML-RPC and it should be indexed in one
> shard on the cluster.
> 2. We receive a query request trough XML-RPC and the query must be executed
> over all shards and then the hits should be returned in a XML response.
>
>
> My initial idea is:
> 1. Indexing
> - The document received is used as input of one map. This function would
> index the document on the local shard using our custom library build on top
> of Lucene. There is no reduce.
>
> 2. Search
> - The query received is used as input of the map function. This function
> would search the document on the local shard using our custom library and
> emit the hits. The reduce function would group the hits from all shards.
>
>
> Is it possible to implement that using Hadoop MapReduce framework?
> Implementing custom InputFormats and OutputFormats?
> Or should I use the Hadoop RPC Layer? There's any documentation about  it?
> Any suggestions?
>
> Thanks,
> Aécio Santos.
>
> --
> Instituto Federal de Ciência, Educação e Tecnologia do Piauí
> Laboratório de Pesquisa em Sistemas de Informação
> Teresina - Piauí - Brazil
>

You can inplement some of what you have described.

Take a look at these if you have not already they are concrete
implementations for building and searching distributed indexes.

http://lucene.apache.org/nutch/
http://katta.sourceforge.net/

(not hadoop based but still pretty cool)
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

Reply via email to