On 04/29/2010 08:58 AM, Danny Leshem wrote:
Hello,

I'm using Hadoop to run a memory intensive job on different input datum.
The job requires the availability (in memory) of some read-only HashMap,
about 4Gb in size.
The same fixed HashMap is used for all input datum.

I'm using a cluster of EC2 machines with more than enough memory (around 7Gb
each) to hold a single instance of the HashMap in full.
The problem is that each MapReduce task runs in its own process, so the
HashMap is replicated times the number of per-machine tasks - not good!

According to the next link, you can force Hadoop to run multiple tasks (of
the same job) in the same JVM:
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
This doesn't seem to work for me - I still see several Java processes
spawned.

But even if it did work, running several jobs in parallel (say, on different
datum) would still require the HashMap to be replicated!
Can one force Hadoop to run all jobs in the same JVM? (as opposed to just
all tasks of a given job)

If not, what's the recommended paradigm for running multiple instances of a
job that requires large read-only structures in memory?

Thanks!
Danny

You could serialize the map and output it to the DistributedCache. Then it would be available for all map & reduce tasks to use. (Though they'd have to read it back in and deserialize it first, of course.)

HTH,

DR

Reply via email to