We actually use CDBs a good bit outside of M/R. This is something worth looking into, but the big structure we're currently using is a giant tree-based lookup table whose access pattern is pretty random, so I don't think caching would be of much use. There is a lesser (but still large) structure this might work for.
- Adam On 2/27/13 10:56 AM, Robert Evans wrote: > Have you looked at things like CDB http://cr.yp.to/cdb.html that would > allow you to keep most of the file on disk and cache hot parts in memory. > That really depends on your access pattern. > > Alternatively you could give yourself more heap and take up two slots for > your map task. > > Also if it is big enough you might want to look at using a reduce to do > the join instead of trying to do a map side join. > > --Bobby > > On 2/27/13 12:42 PM, "Adam Phelps" <a...@opendns.com> wrote: > >> We have a job that uses a large lookup structure that gets created as a >> static class during the map setup phase (and we have the JVM reused so >> this only takes place once). However of late this structure has grown >> drastically (due to items beyond our control) and we've seen a >> substantial increase in map time due to the lower available memory. >> >> Are there any easy solutions to this sort of problem? My first thought >> was to see if it was possible to have all tasks for a job execute in >> parallel within the same JVM, but I'm not seeing any setting that would >> allow that. Beyond that my only ideas are to move that data into an >> external one-per-node key-value store like memcached, but I'm worried >> the additional overhead of sending a query for each value being mapped >> would also kill the job performance. >> >> - Adam >