We actually use CDBs a good bit outside of M/R.  This is something worth
looking into, but the big structure we're currently using is a giant
tree-based lookup table whose access pattern is pretty random, so I
don't think caching would be of much use.  There is a lesser (but still
large) structure this might work for.

- Adam

On 2/27/13 10:56 AM, Robert Evans wrote:
> Have you looked at things like CDB http://cr.yp.to/cdb.html that would
> allow you to keep most of the file on disk and cache hot parts in memory.
> That really depends on your access pattern.
> 
> Alternatively you could give yourself more heap and take up two slots for
> your map task.
> 
> Also if it is big enough you might want to look at using a reduce to do
> the join instead of trying to do a map side join.
> 
> --Bobby
> 
> On 2/27/13 12:42 PM, "Adam Phelps" <a...@opendns.com> wrote:
> 
>> We have a job that uses a large lookup structure that gets created as a
>> static class during the map setup phase (and we have the JVM reused so
>> this only takes place once).  However of late this structure has grown
>> drastically (due to items beyond our control) and we've seen a
>> substantial increase in map time due to the lower available memory.
>>
>> Are there any easy solutions to this sort of problem?  My first thought
>> was to see if it was possible to have all tasks for a job execute in
>> parallel within the same JVM, but I'm not seeing any setting that would
>> allow that.  Beyond that my only ideas are to move that data into an
>> external one-per-node key-value store like memcached, but I'm worried
>> the additional overhead of sending a query for each value being mapped
>> would also kill the job performance.
>>
>> - Adam
> 

Reply via email to