Hello all,
I was wondering if someone in the know could tell me about the current
state of play with building and searching large indices with hadoop?
Some background: I work on the human genome project, and we're
currently setting up a new facility based around the next generation
of DNA sequencing. We're currently producing around 50Tb of data a
week, some of which we would like to provide fast access to via an
index.
Having read up on hadoop, it appears that it could play a central part
in our infrastructure, and that others have tried (and succeeded) in
building a distributed indexing and retrieval system with hadoop. I'd
be interested if anyone could point me in the right direction to more
information or examples of such a system. Yahoo! (with webmap) seems
to be close to the sort of thing we would need.
Would map/reduce be a suitable approach for indexing _and_ retrieval,
or just indexing? Would Solr/Lucene be a good fit? Any help or
pointers to more information would be much appreciated!
If you would like any more details, I'd be more than happy to supply
them!
Many thanks,
~ Matt
-------------
Matt Wood
Sequencing Informatics // Production Software
www.sanger.ac.uk
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.