Miles Osborne wrote:
the problem here is that you don't want each mapper/reducer to have a
copy of the data.  you want that data --which can be very large--
stored in a distributed manner over your cluster and allow random
access to it during computation.

(this is what HBase etc do)

I had a somewhat similar situation to the one that the original poster described. In my case the trick proved to be to avoid actually accessing the data when possible ... The key space was very large, but sparsely populated - namely, consisting of phrases up to N words long. The MapFile-s that contained a reference dictionary were large (in the order of hundred million records) and it was inconvenient to copy them to local machines.

I managed to get a decent performance by combining two approaches:

* using a fail-fast version of MapFile-s (see HADOOP-3063) - my map() implementation generated a lot of keys, which had to be tested against the dictionaries, and in most cases the phrases wouldn't exist in the dictionary, so they didn't have to be actually retrieved. Result - no I/O to check for missing keys. The BloomFilter in the BloomMapFile is loaded completely in memory, so the speed of lookup was fantastic.

* and in case I really had to load a few records from the dictionaries, I kept a local LRU cache. I went with a simple LinkedHashMap, but you could be more sophisticated and use JCS or something like that.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to