>>I've looked at a couple of other solutions for generic master/worker
type >>of functionality, but we'd like to stick to an open source
implementation.

Have you folks looked at Torque/Maui for cluster resource scheduling?

--Venkat

-----Original Message-----
From: Kirk True [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 25, 2007 10:04 AM
To: hadoop-user@lucene.apache.org
Subject: RE: Appropriate use of Hadoop for non-map/reduce tasks?

Hi all,

Thanks for all the replies thus far...

Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote: in many cases - long
running tasks are of low cpu util. i have trouble imagining how these
can mix well with cpu intensive short/batch tasks. afaik - hadoop's job
scheduling is not resource usage aware. long background tasks would
consume per-machine task slots that would block out other tasks from
using available cpu bandwidth. 

Maybe I should clarify things...

The jobs that we're presently trying to use Hadoop for are fairly
long-lived (i.e. ~15 minutes) but -- to Chad's point -- they are finite.


That said, the long-livedness of the individual jobs is a "temporary"
thing in that we'll be making each job do less work such that they are
~1 minute. 

To John's point, yes, the question is not 'is this optimal?' but it is
it reasonable to use a framework geared for map/reduce operations to
simply distributed jobs over multiple machines. 

I've looked at a couple of other solutions for generic master/worker
type of functionality, but we'd like to stick to an open source
implementation. 

Like I said before, I can get Hadoop to do what I need. But that doesn't
make it "right" ;)

Thanks,
Kirk

-----Original Message-----
From: Chad Walters [mailto:[EMAIL PROTECTED]
Sent: Sat 12/22/2007 2:39 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Appropriate use of Hadoop for non-map/reduce tasks?
 

I should further say that god functions only on a per machine basis. We
have then built a number of scripts that do auto-configuration of our
various services, using configs pulled from LDAP and code pulled from
our package repo. We use this to configure our various server processes
and also to configure Hadoop clusters (HDFS and Map/Reduce). But god is
a key part of the system, since it helps us provide a uniform interface
for starting and stopping all our services.

Chad


On 12/22/07 1:30 PM, "Chad Walters"  wrote:

I am not really sure that Hadoop is right for what Jeff is describing.

I think there may be two separate problems:

 1.  Batch tasks that may take a long time but are expected to have a
finite termination
 2.  Long-lived server processes that have an indefinite lifetime

For #1, we pretty much use Hadoop, although we have built a fairly
extensive framework inside of these long map tasks to track progress and
handle various failure conditions that can arise. If people are really
interested, I'll poke around and see if any of it is general enough to
warrant contributing back, but I think a lot of it is probably fairly
specific to the kinds of failure cases we expect from the components
involved in the long map task.

For #2, we are using something called "god" (http://god.rubyforge.org/).
One of our developers ended up starting this project because he didn't
like monit. We liked the way it was going and now we now we use it
throughout our datacenter to start, stop, and health check our server
processes. It supports both polling and event-driven actions and is
pretty extensible. Check it out to see if it might satisfy some of your
needs.

Chad


On 12/22/07 11:40 AM, "Jeff Hammerbacher"  wrote:

yo,
from my understanding, the map/reduce codebase grew out of the codebase
for
"the borg", google's system for managing long-running processes.  we
could
definitely use this sort of functionality, and the
jobtracker/tasktracker
paradigm goes part of the way there.  sqs really helps when you want to
run
a set of recurring, dependent processes (a problem our group definitely
needs to solve), but it doesn't really seem to address the issue of
managing
those processes when they're long-lived.

for instance, when we deploy our search servers, we have a script that
basically says "daemonize this process on this many boxes, and if it
enters
a condition that doesn't look healthy, take this action (like restart,
or
rebuild the index, etc.)".  given how hard-coded the task-type is into
map/reduce (er, "map" and "reduce"), it's hard to specify new types of
error
conditions and running conditions for your processes.  also, the
jobtracker
doesn't have any high availability guarantees, so you could run into a
situation where your processes are fine but the jobtracker goes down.
 zookeeper could help here.  it'd be sweet if hadoop could handle this
long-lived process management scenario.

kirk, i'd be interested in hearing more about your processes and the
requirements you have of your process manager.  we're exploring other
solutions to this problem and i'd be happy to connect you with the folks
here who are thinking about the issue.

later,
jeff

On Dec 21, 2007 12:42 PM, John Heidemann  wrote:

> On Fri, 21 Dec 2007 12:24:57 PST, John Heidemann wrote:
> >On Thu, 20 Dec 2007 18:46:58 PST, Kirk True wrote:
> >>Hi all,
> >>
> >>A lot of the ideas I have for incorporating Hadoop into internal
> projects revolves around distributing long-running tasks over multiple
> machines. I've been able to get a quick prototype up in Hadoop for one
of
> those projects and it seems to work pretty well.
> >>...
> >He's not saying "is Hadoop optimal" for things that aren't really
> >map/reduce, but "is it reasonable" for those things?
> >(Kirk, is that right?)
> >...
>
> Sorry to double reply, but I left out my comment to (my view of)
Kirk's
> question.
>
> In addition to what Ted said, I'm not sure how well Hadoop works with
> long-running jobs, particuarlly how well that interacts with its fault
> tolerance code.
>
> And more generally, if you're not doing map/reduce than you'd probably
> have to build your own fault tolerance methods.
>
>   -John Heidemann
>
>






Reply via email to