Re: Memory intensive jobs and JVM reuse

Nick Jones Thu, 29 Apr 2010 19:32:30 -0700

On 4/29/2010 10:52 AM, Aaron Kimball wrote:

* JVM reuse only applies within the same job. Different jobs arealways different JVMs* JVM reuse is serial; you'll only get task B in a JVM after task Ahas already completed -- never both at the same time. If you configureHadoop to run 4 tasks per TT concurrently, you'll still have 4 JVMs up.
There isn't an easy way to play tricks to get this large datastructure of yours to be shared between tasks or jobs, using Hadoop'sown mechanisms. Instead, you need to look to an external system.e.g., populate memcached with your hashmap ahead of time and have allyour jobs query that. Or use a BDB or TokyoCabinet or some otherdisk-backed key-value database and have each task demand-load only the(k, v) pairs that are relevant to the particular task.
- Aaron
On Thu, Apr 29, 2010 at 8:43 AM, David Rosenstrauch <dar...@darose.net<mailto:dar...@darose.net>> wrote:
    On 04/29/2010 11:08 AM, Danny Leshem wrote:

        David,

        DistributedCache distributes files across the cluster - it is
        not a shared
        memory cache.
        My problem is not distributing the HashMap across machines,
        but the fact
        that it is replicated in memory for each task (or each job,
        for that
        matter).


    OK, sorry for the misunderstanding.

    Hmmm ... well ... I thought there was a config parm to control
    whether the task gets launched in a new VM, but I can't seem to
    find it.  A quick look at the list of map/reduce parms turned up
    this, though:

    mapred.job.reuse.jvm.num.tasks  default: 1      How many tasks to
    run per jvm. If set to -1, there is no limit.

    Perhaps that might help?  I'm speculating here, but in theory, if
    you set it to -1, then all task attempts per job per node would
    run in the same VM ... and so be able to have access to the same
    static variables.

    You might also want to poke around in the full list of map/reduce
    config parms and see if there's anything else in there that might
    help solve this:

    http://hadoop.apache.org/common/docs/current/mapred-default.html

    HTH,

    DR

Couldn't the DistributedCache idea still work with a chained set ofjobs? Map the first set into files on the DFS and add them to the DCfor the next time through?


Nick Jones

Re: Memory intensive jobs and JVM reuse

Reply via email to