Hi I'd be interested to know if you've tried to use Hadoop for a large number of short jobs. Perhaps I am missing something, but I've found that the hardcoded Thread.sleep() calls, esp. those for 5 seconds in mapred.ReduceTask (primarily) and mapred.JobClient, cause more of a problem than the 0.3 sec or so that it takes to fire up a JVM.
Agreed that for long running jobs that is not a concern, but *if* we'd want to speed things up for shorter running jobs (say < 1 min) is a goal, then JVM reuse would seem to be a lower priority? Would doing something about those sleep()s seem worthwhile? Thanks, Spiros On Sat, Mar 1, 2008 at 4:33 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Mar 1, 2008, at 12:05 PM, Amar Kamat wrote: > > >> 3) Lastly, it would seem beneficial for jobs that have significant > >> startup overhead and memory requirements to not be run in separate > >> JVMs for each task. Along these lines, it looks like someone > >> submitted a patch for JVM-reuse a while back, but it wasn't > >> commited? https://issues.apache.org/jira/browse/HADOOP-249 > > Most of the ideas in the patch for 249 were committed as other > patches, but that bug has been left open precisely because the idea > still has merit. The patch was never stable enough to commit and now > is hopelessly out of date. There are lots of little issues that would > need to be addressed for this to happen. > > >> Probably a question for the dev mailing list, but if I wanted to > >> modify hadoop to allow threading tasks, rather than running > >> independent JVMs, is there any reason someone hasn't done this > >> yet? Or am I overlooking something? > > This is done to keep user code separate from the framework code. > > Precisely. We don't want to go through the security manager in the > servers, so it is far easier to keep user code out of the servers. > > > So if the user code develops a fault the framework and rest of the > > jobs function normally. Most of the jobs have a longer run time and > > hence the startup time is never a concern. > > As long as the tasks belong to the same job (and therefore user), > sharing a jvm should be fine. One concern is that currently each task > gets its own working directory. Since Java can't change working > directory in a running process, it would have to clean up the working > directory. That will interact badly with debugging settings that let > you keep the task files. However, as we speed things up, it will > become more important. Already we are starting to see sort maps that > finish in 17 seconds, which means the 1 second of jvm startup is a > 5% overhead... > > -- Owen > >
