On Mar 1, 2008, at 12:05 PM, Amar Kamat wrote:
3) Lastly, it would seem beneficial for jobs that have significant
startup overhead and memory requirements to not be run in separate
JVMs for each task. Along these lines, it looks like someone
submitted a patch for JVM-reuse a while back, but it wasn't
commited? https://issues.apache.org/jira/browse/HADOOP-249
Most of the ideas in the patch for 249 were committed as other
patches, but that bug has been left open precisely because the idea
still has merit. The patch was never stable enough to commit and now
is hopelessly out of date. There are lots of little issues that would
need to be addressed for this to happen.
Probably a question for the dev mailing list, but if I wanted to
modify hadoop to allow threading tasks, rather than running
independent JVMs, is there any reason someone hasn't done this
yet? Or am I overlooking something?
This is done to keep user code separate from the framework code.
Precisely. We don't want to go through the security manager in the
servers, so it is far easier to keep user code out of the servers.
So if the user code develops a fault the framework and rest of the
jobs function normally. Most of the jobs have a longer run time and
hence the startup time is never a concern.
As long as the tasks belong to the same job (and therefore user),
sharing a jvm should be fine. One concern is that currently each task
gets its own working directory. Since Java can't change working
directory in a running process, it would have to clean up the working
directory. That will interact badly with debugging settings that let
you keep the task files. However, as we speed things up, it will
become more important. Already we are starting to see sort maps that
finish in 17 seconds, which means the 1 second of jvm startup is a
5% overhead...
-- Owen