Miles Osborne wrote:
the problem here is that you don't want each mapper/reducer to have a
copy of the data. you want that data --which can be very large--
stored in a distributed manner over your cluster and allow random
access to it during computation.
(this is what HBase etc do)
I had a somewhat similar situation to the one that the original poster
described. In my case the trick proved to be to avoid actually accessing
the data when possible ... The key space was very large, but sparsely
populated - namely, consisting of phrases up to N words long. The
MapFile-s that contained a reference dictionary were large (in the order
of hundred million records) and it was inconvenient to copy them to
local machines.
I managed to get a decent performance by combining two approaches:
* using a fail-fast version of MapFile-s (see HADOOP-3063) - my map()
implementation generated a lot of keys, which had to be tested against
the dictionaries, and in most cases the phrases wouldn't exist in the
dictionary, so they didn't have to be actually retrieved. Result - no
I/O to check for missing keys. The BloomFilter in the BloomMapFile is
loaded completely in memory, so the speed of lookup was fantastic.
* and in case I really had to load a few records from the dictionaries,
I kept a local LRU cache. I went with a simple LinkedHashMap, but you
could be more sophisticated and use JCS or something like that.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com