On Mar 1, 2008, at 12:05 PM, Amar Kamat wrote:

3) Lastly, it would seem beneficial for jobs that have significant startup overhead and memory requirements to not be run in separate JVMs for each task. Along these lines, it looks like someone submitted a patch for JVM-reuse a while back, but it wasn't commited? https://issues.apache.org/jira/browse/HADOOP-249

Most of the ideas in the patch for 249 were committed as other patches, but that bug has been left open precisely because the idea still has merit. The patch was never stable enough to commit and now is hopelessly out of date. There are lots of little issues that would need to be addressed for this to happen.

Probably a question for the dev mailing list, but if I wanted to modify hadoop to allow threading tasks, rather than running independent JVMs, is there any reason someone hasn't done this yet? Or am I overlooking something?
This is done to keep user code separate from the framework code.

Precisely. We don't want to go through the security manager in the servers, so it is far easier to keep user code out of the servers.

So if the user code develops a fault the framework and rest of the jobs function normally. Most of the jobs have a longer run time and hence the startup time is never a concern.

As long as the tasks belong to the same job (and therefore user), sharing a jvm should be fine. One concern is that currently each task gets its own working directory. Since Java can't change working directory in a running process, it would have to clean up the working directory. That will interact badly with debugging settings that let you keep the task files. However, as we speed things up, it will become more important. Already we are starting to see sort maps that finish in 17 seconds, which means the 1 second of jvm startup is a 5% overhead...

-- Owen

Reply via email to