the problem here is that you don't want each mapper/reducer to have a copy of the data. you want that data --which can be very large-- stored in a distributed manner over your cluster and allow random access to it during computation.
(this is what HBase etc do) Miles 2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>: > On Thu, Sep 18, 2008 at 1:05 AM, Chris Dyer <[EMAIL PROTECTED]> wrote: >> Basically, I'd like to be able to >> load the entire contents of a file key-value map file in DFS into >> memory across many machines in my cluster so that I can access any of >> it with ultra-low latencies. > > I think the simplest way, which I've used, is to put your key-value > file into DistributedCache, then load it into a HashMap or ArrayList > in the configure method of each Map/Reduce task. > > -Stuart > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
