Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Andrzej Bialecki Fri, 19 Sep 2008 10:18:18 -0700

Miles Osborne wrote:

the problem here is that you don't want each mapper/reducer to have a
copy of the data.  you want that data --which can be very large--
stored in a distributed manner over your cluster and allow random
access to it during computation.


(this is what HBase etc do)

I had a somewhat similar situation to the one that the original posterdescribed. In my case the trick proved to be to avoid actually accessingthe data when possible ... The key space was very large, but sparselypopulated - namely, consisting of phrases up to N words long. TheMapFile-s that contained a reference dictionary were large (in the orderof hundred million records) and it was inconvenient to copy them tolocal machines.


I managed to get a decent performance by combining two approaches:

* using a fail-fast version of MapFile-s (see HADOOP-3063) - my map()implementation generated a lot of keys, which had to be tested againstthe dictionaries, and in most cases the phrases wouldn't exist in thedictionary, so they didn't have to be actually retrieved. Result - noI/O to check for missing keys. The BloomFilter in the BloomMapFile isloaded completely in memory, so the speed of lookup was fantastic.

* and in case I really had to load a few records from the dictionaries,I kept a local LRU cache. I went with a simple LinkedHashMap, but youcould be more sophisticated and use JCS or something like that.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Reply via email to