Can you show (cut & paste) how whats your job config looks like.

On Thu, Apr 29, 2010 at 8:58 AM, Danny Leshem <dles...@gmail.com> wrote:

> Hello,
>
> I'm using Hadoop to run a memory intensive job on different input datum.
> The job requires the availability (in memory) of some read-only HashMap,
> about 4Gb in size.
> The same fixed HashMap is used for all input datum.
>
> I'm using a cluster of EC2 machines with more than enough memory (around
> 7Gb each) to hold a single instance of the HashMap in full.
> The problem is that each MapReduce task runs in its own process, so the
> HashMap is replicated times the number of per-machine tasks - not good!
>
> According to the next link, you can force Hadoop to run multiple tasks (of
> the same job) in the same JVM:
>
> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
> This doesn't seem to work for me - I still see several Java processes
> spawned.
>
> But even if it did work, running several jobs in parallel (say, on
> different datum) would still require the HashMap to be replicated!
> Can one force Hadoop to run all jobs in the same JVM? (as opposed to just
> all tasks of a given job)
>
> If not, what's the recommended paradigm for running multiple instances of a
> job that requires large read-only structures in memory?
>
> Thanks!
> Danny
>



-- 

Raja Thiruvathuru

Reply via email to