map/reduce will be a suitable approach for indexing large doc collections. but I don't know is it suitable for retrieval. you can see *Nutch* for the distributed searching.

under the hadoop/contrib directory , there is a *Index* package. It may be helpful :)

Matt Wood 写道:
Hello all,

I was wondering if someone in the know could tell me about the current state of play with building and searching large indices with hadoop?

Some background: I work on the human genome project, and we're currently setting up a new facility based around the next generation of DNA sequencing. We're currently producing around 50Tb of data a week, some of which we would like to provide fast access to via an index.

Having read up on hadoop, it appears that it could play a central part in our infrastructure, and that others have tried (and succeeded) in building a distributed indexing and retrieval system with hadoop. I'd be interested if anyone could point me in the right direction to more information or examples of such a system. Yahoo! (with webmap) seems to be close to the sort of thing we would need.

Would map/reduce be a suitable approach for indexing _and_ retrieval, or just indexing? Would Solr/Lucene be a good fit? Any help or pointers to more information would be much appreciated!

If you would like any more details, I'd be more than happy to supply them!

Many thanks,

~ Matt


-------------

Matt Wood
Sequencing Informatics // Production Software
www.sanger.ac.uk




Reply via email to