HADOOP is not meant for real time applications. Its more or less designed for long running applications like crawlers/indexers.
Amar
On Mon, 3 Mar 2008, Spiros Papadimitriou wrote:

Hi

I'd be interested to know if you've tried to use Hadoop for a large number
of short jobs.  Perhaps I am missing something, but I've found that the
hardcoded Thread.sleep() calls, esp. those for 5 seconds in
mapred.ReduceTask (primarily) and mapred.JobClient, cause more of a problem
than the 0.3 sec or so that it takes to fire up a JVM.

Agreed that for long running jobs that is not a concern, but *if* we'd want
to speed things up for shorter running jobs  (say < 1 min) is a goal, then
JVM reuse would seem to be a lower priority?  Would doing something about
those sleep()s seem worthwhile?

Thanks,
Spiros

On Sat, Mar 1, 2008 at 4:33 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:


On Mar 1, 2008, at 12:05 PM, Amar Kamat wrote:

3) Lastly, it would seem beneficial for jobs that have significant
startup overhead and memory requirements to not be run in separate
JVMs for each task.  Along these lines, it looks like someone
submitted a patch for JVM-reuse a while back, but it wasn't
commited? https://issues.apache.org/jira/browse/HADOOP-249

Most of the ideas in the patch for 249 were committed as other
patches, but that bug has been left open precisely because the idea
still has merit. The patch was never stable enough to commit and now
is hopelessly out of date. There are lots of little issues that would
need to be addressed for this to happen.

Probably a question for the dev mailing list, but if I wanted to
modify hadoop to allow threading tasks, rather than running
independent JVMs, is there any reason someone hasn't done this
yet?  Or am I overlooking something?
This is done to keep user code separate from the framework code.

Precisely. We don't want to go through the security manager in the
servers, so it is far easier to keep user code out of the servers.

So if the user code develops a fault the framework and rest of the
jobs function normally. Most of the jobs have a longer run time and
hence the startup time is never a concern.

As long as the tasks belong to the same job (and therefore user),
sharing a jvm should be fine. One concern is that currently each task
gets its own working directory. Since Java can't change working
directory in a running process, it would have to clean up the working
directory. That will interact badly with debugging settings that let
you keep the task files. However, as we speed things up, it will
become more important. Already we are starting to see sort maps that
finish in 17 seconds,  which means the 1 second of jvm startup is a
5% overhead...

-- Owen



Reply via email to