Distributed indexing

Matt Wood Mon, 28 Apr 2008 07:50:36 -0700

Hello all,

I was wondering if someone in the know could tell me about the currentstate of play with building and searching large indices with hadoop?

Some background: I work on the human genome project, and we'recurrently setting up a new facility based around the next generationof DNA sequencing. We're currently producing around 50Tb of data aweek, some of which we would like to provide fast access to via anindex.

Having read up on hadoop, it appears that it could play a central partin our infrastructure, and that others have tried (and succeeded) inbuilding a distributed indexing and retrieval system with hadoop. I'dbe interested if anyone could point me in the right direction to moreinformation or examples of such a system. Yahoo! (with webmap) seemsto be close to the sort of thing we would need.

Would map/reduce be a suitable approach for indexing _and_ retrieval,or just indexing? Would Solr/Lucene be a good fit? Any help orpointers to more information would be much appreciated!

If you would like any more details, I'd be more than happy to supplythem!


Many thanks,

~ Matt


-------------

Matt Wood
Sequencing Informatics // Production Software
www.sanger.ac.uk



--

The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE.

Distributed indexing

Reply via email to