Can you show (cut & paste) how whats your job config looks like. On Thu, Apr 29, 2010 at 8:58 AM, Danny Leshem <dles...@gmail.com> wrote:
> Hello, > > I'm using Hadoop to run a memory intensive job on different input datum. > The job requires the availability (in memory) of some read-only HashMap, > about 4Gb in size. > The same fixed HashMap is used for all input datum. > > I'm using a cluster of EC2 machines with more than enough memory (around > 7Gb each) to hold a single instance of the HashMap in full. > The problem is that each MapReduce task runs in its own process, so the > HashMap is replicated times the number of per-machine tasks - not good! > > According to the next link, you can force Hadoop to run multiple tasks (of > the same job) in the same JVM: > > http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse > This doesn't seem to work for me - I still see several Java processes > spawned. > > But even if it did work, running several jobs in parallel (say, on > different datum) would still require the HashMap to be replicated! > Can one force Hadoop to run all jobs in the same JVM? (as opposed to just > all tasks of a given job) > > If not, what's the recommended paradigm for running multiple instances of a > job that requires large read-only structures in memory? > > Thanks! > Danny > -- Raja Thiruvathuru