Re: Distributed indexing

Otis Gospodnetic Mon, 28 Apr 2008 08:22:11 -0700

Matt,

You probably want to mail core-user, not core-dev....


Here is what I wrote on [EMAIL PROTECTED] yesterday (in answer to Samuel Gao's 
question there):

There are actually several distributed indexing or searching projectsin Lucene 
(the top-level ASF Lucene project, not Lucene Java), and it'stime to start 
thinking about the possibility of bringing them together,finding commonalities, 
etc.

Here is the summary:
- Lucene - distributed search via ParallelMultiSearcher.  How you split 
indices/shards is up to you.
- Solr - distributed search via SOLR-303 (see DistributedSearch on its Wiki).  
How you split indices/shards is up to you.
- Nutch - distributed search via its org.apache.nutch.ipc (I think).  How you 
split indices/segments is up to you.
- Nutch - see the bottom of http://wiki.apache.org/nutch/Nutch2Architecture for 
a new push to come up with shard management tools

There is also Hadoop:
- Using MapReduce + HDFS to build a single Lucene index in a distributed 
fashion (see contrib/index in Hadoop).

There is also GridLucene project somewhere on the web...
 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Matt Wood <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [email protected]
> Sent: Monday, April 28, 2008 4:50:00 PM
> Subject: Distributed indexing
> 
> Hello all,
> 
> I was wondering if someone in the know could tell me about the current  
> state of play with building and searching large indices with hadoop?
> 
> Some background: I work on the human genome project, and we're  
> currently setting up a new facility based around the next generation  
> of DNA sequencing. We're currently producing around 50Tb of data a  
> week, some of which we would like to provide fast access to via an  
> index.
> 
> Having read up on hadoop, it appears that it could play a central part  
> in our infrastructure, and that others have tried (and succeeded) in  
> building a distributed indexing and retrieval system with hadoop. I'd  
> be interested if anyone could point me in the right direction to more  
> information or examples of such a system. Yahoo! (with webmap) seems  
> to be close to the sort of thing we would need.
> 
> Would map/reduce be a suitable approach for indexing _and_ retrieval,  
> or just indexing? Would Solr/Lucene be a good fit? Any help or  
> pointers to more information would be  much appreciated!
> 
> If you would like any more details, I'd be more than happy to supply  
> them!
> 
> Many thanks,
> 
> ~ Matt
> 
> 
> -------------
> 
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
> 
> 
> 
> -- 
>  The Wellcome Trust Sanger Institute is operated by Genome Research 
>  Limited, a charity registered in England with number 1021457 and a 
>  company registered in England with number 2742969, whose registered 
>  office is 215 Euston Road, London, NW1 2BE. 
>

Re: Distributed indexing

Reply via email to