Matt, You probably want to mail core-user, not core-dev....
Here is what I wrote on [EMAIL PROTECTED] yesterday (in answer to Samuel Gao's question there): There are actually several distributed indexing or searching projectsin Lucene (the top-level ASF Lucene project, not Lucene Java), and it'stime to start thinking about the possibility of bringing them together,finding commonalities, etc. Here is the summary: - Lucene - distributed search via ParallelMultiSearcher. How you split indices/shards is up to you. - Solr - distributed search via SOLR-303 (see DistributedSearch on its Wiki). How you split indices/shards is up to you. - Nutch - distributed search via its org.apache.nutch.ipc (I think). How you split indices/segments is up to you. - Nutch - see the bottom of http://wiki.apache.org/nutch/Nutch2Architecture for a new push to come up with shard management tools There is also Hadoop: - Using MapReduce + HDFS to build a single Lucene index in a distributed fashion (see contrib/index in Hadoop). There is also GridLucene project somewhere on the web... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Matt Wood <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: [email protected] > Sent: Monday, April 28, 2008 4:50:00 PM > Subject: Distributed indexing > > Hello all, > > I was wondering if someone in the know could tell me about the current > state of play with building and searching large indices with hadoop? > > Some background: I work on the human genome project, and we're > currently setting up a new facility based around the next generation > of DNA sequencing. We're currently producing around 50Tb of data a > week, some of which we would like to provide fast access to via an > index. > > Having read up on hadoop, it appears that it could play a central part > in our infrastructure, and that others have tried (and succeeded) in > building a distributed indexing and retrieval system with hadoop. I'd > be interested if anyone could point me in the right direction to more > information or examples of such a system. Yahoo! (with webmap) seems > to be close to the sort of thing we would need. > > Would map/reduce be a suitable approach for indexing _and_ retrieval, > or just indexing? Would Solr/Lucene be a good fit? Any help or > pointers to more information would be much appreciated! > > If you would like any more details, I'd be more than happy to supply > them! > > Many thanks, > > ~ Matt > > > ------------- > > Matt Wood > Sequencing Informatics // Production Software > www.sanger.ac.uk > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. >
