I am not really sure that Hadoop is right for what Jeff is describing. I think there may be two separate problems:
1. Batch tasks that may take a long time but are expected to have a finite termination 2. Long-lived server processes that have an indefinite lifetime For #1, we pretty much use Hadoop, although we have built a fairly extensive framework inside of these long map tasks to track progress and handle various failure conditions that can arise. If people are really interested, I'll poke around and see if any of it is general enough to warrant contributing back, but I think a lot of it is probably fairly specific to the kinds of failure cases we expect from the components involved in the long map task. For #2, we are using something called "god" (http://god.rubyforge.org/). One of our developers ended up starting this project because he didn't like monit. We liked the way it was going and now we now we use it throughout our datacenter to start, stop, and health check our server processes. It supports both polling and event-driven actions and is pretty extensible. Check it out to see if it might satisfy some of your needs. Chad On 12/22/07 11:40 AM, "Jeff Hammerbacher" <[EMAIL PROTECTED]> wrote: yo, from my understanding, the map/reduce codebase grew out of the codebase for "the borg", google's system for managing long-running processes. we could definitely use this sort of functionality, and the jobtracker/tasktracker paradigm goes part of the way there. sqs really helps when you want to run a set of recurring, dependent processes (a problem our group definitely needs to solve), but it doesn't really seem to address the issue of managing those processes when they're long-lived. for instance, when we deploy our search servers, we have a script that basically says "daemonize this process on this many boxes, and if it enters a condition that doesn't look healthy, take this action (like restart, or rebuild the index, etc.)". given how hard-coded the task-type is into map/reduce (er, "map" and "reduce"), it's hard to specify new types of error conditions and running conditions for your processes. also, the jobtracker doesn't have any high availability guarantees, so you could run into a situation where your processes are fine but the jobtracker goes down. zookeeper could help here. it'd be sweet if hadoop could handle this long-lived process management scenario. kirk, i'd be interested in hearing more about your processes and the requirements you have of your process manager. we're exploring other solutions to this problem and i'd be happy to connect you with the folks here who are thinking about the issue. later, jeff On Dec 21, 2007 12:42 PM, John Heidemann <[EMAIL PROTECTED]> wrote: > On Fri, 21 Dec 2007 12:24:57 PST, John Heidemann wrote: > >On Thu, 20 Dec 2007 18:46:58 PST, Kirk True wrote: > >>Hi all, > >> > >>A lot of the ideas I have for incorporating Hadoop into internal > projects revolves around distributing long-running tasks over multiple > machines. I've been able to get a quick prototype up in Hadoop for one of > those projects and it seems to work pretty well. > >>... > >He's not saying "is Hadoop optimal" for things that aren't really > >map/reduce, but "is it reasonable" for those things? > >(Kirk, is that right?) > >... > > Sorry to double reply, but I left out my comment to (my view of) Kirk's > question. > > In addition to what Ted said, I'm not sure how well Hadoop works with > long-running jobs, particuarlly how well that interacts with its fault > tolerance code. > > And more generally, if you're not doing map/reduce than you'd probably > have to build your own fault tolerance methods. > > -John Heidemann > >