Ahhh.... My previous comments assumed that "long-lived" meant jobs that run for days and days and days (essentially forever).
15 minute jobs with a finite work-list is actually a pretty good match for map-reduce as implemented by Hadoop. On 12/25/07 10:04 AM, "Kirk True" <[EMAIL PROTECTED]> wrote: > Hi all, > > Thanks for all the replies thus far... > > Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote: in many cases - long running > tasks are of low cpu util. i have trouble imagining how these can mix well > with cpu intensive short/batch tasks. afaik - hadoop's job scheduling is not > resource usage aware. long background tasks would consume per-machine task > slots that would block out other tasks from using available cpu bandwidth. > > Maybe I should clarify things... > > The jobs that we're presently trying to use Hadoop for are fairly long-lived > (i.e. ~15 minutes) but -- to Chad's point -- they are finite. > > That said, the long-livedness of the individual jobs is a "temporary" thing in > that we'll be making each job do less work such that they are ~1 minute. > > To John's point, yes, the question is not 'is this optimal?' but it is it > reasonable to use a framework geared for map/reduce operations to simply > distributed jobs over multiple machines. > > I've looked at a couple of other solutions for generic master/worker type of > functionality, but we'd like to stick to an open source implementation. > > Like I said before, I can get Hadoop to do what I need. But that doesn't make > it "right" ;) > > Thanks, > Kirk > > -----Original Message----- > From: Chad Walters [mailto:[EMAIL PROTECTED] > Sent: Sat 12/22/2007 2:39 PM > To: [email protected] > Subject: Re: Appropriate use of Hadoop for non-map/reduce tasks? > > > I should further say that god functions only on a per machine basis. We have > then built a number of scripts that do auto-configuration of our various > services, using configs pulled from LDAP and code pulled from our package > repo. We use this to configure our various server processes and also to > configure Hadoop clusters (HDFS and Map/Reduce). But god is a key part of the > system, since it helps us provide a uniform interface for starting and > stopping all our services. > > Chad > > > On 12/22/07 1:30 PM, "Chad Walters" wrote: > > I am not really sure that Hadoop is right for what Jeff is describing. > > I think there may be two separate problems: > > 1. Batch tasks that may take a long time but are expected to have a finite > termination > 2. Long-lived server processes that have an indefinite lifetime > > For #1, we pretty much use Hadoop, although we have built a fairly extensive > framework inside of these long map tasks to track progress and handle various > failure conditions that can arise. If people are really interested, I'll poke > around and see if any of it is general enough to warrant contributing back, > but I think a lot of it is probably fairly specific to the kinds of failure > cases we expect from the components involved in the long map task. > > For #2, we are using something called "god" (http://god.rubyforge.org/). One > of our developers ended up starting this project because he didn't like monit. > We liked the way it was going and now we now we use it throughout our > datacenter to start, stop, and health check our server processes. It supports > both polling and event-driven actions and is pretty extensible. Check it out > to see if it might satisfy some of your needs. > > Chad > > > On 12/22/07 11:40 AM, "Jeff Hammerbacher" wrote: > > yo, > from my understanding, the map/reduce codebase grew out of the codebase for > "the borg", google's system for managing long-running processes. we could > definitely use this sort of functionality, and the jobtracker/tasktracker > paradigm goes part of the way there. sqs really helps when you want to run > a set of recurring, dependent processes (a problem our group definitely > needs to solve), but it doesn't really seem to address the issue of managing > those processes when they're long-lived. > > for instance, when we deploy our search servers, we have a script that > basically says "daemonize this process on this many boxes, and if it enters > a condition that doesn't look healthy, take this action (like restart, or > rebuild the index, etc.)". given how hard-coded the task-type is into > map/reduce (er, "map" and "reduce"), it's hard to specify new types of error > conditions and running conditions for your processes. also, the jobtracker > doesn't have any high availability guarantees, so you could run into a > situation where your processes are fine but the jobtracker goes down. > zookeeper could help here. it'd be sweet if hadoop could handle this > long-lived process management scenario. > > kirk, i'd be interested in hearing more about your processes and the > requirements you have of your process manager. we're exploring other > solutions to this problem and i'd be happy to connect you with the folks > here who are thinking about the issue. > > later, > jeff > > On Dec 21, 2007 12:42 PM, John Heidemann wrote: > >> On Fri, 21 Dec 2007 12:24:57 PST, John Heidemann wrote: >>> On Thu, 20 Dec 2007 18:46:58 PST, Kirk True wrote: >>>> Hi all, >>>> >>>> A lot of the ideas I have for incorporating Hadoop into internal >> projects revolves around distributing long-running tasks over multiple >> machines. I've been able to get a quick prototype up in Hadoop for one of >> those projects and it seems to work pretty well. >>>> ... >>> He's not saying "is Hadoop optimal" for things that aren't really >>> map/reduce, but "is it reasonable" for those things? >>> (Kirk, is that right?) >>> ... >> >> Sorry to double reply, but I left out my comment to (my view of) Kirk's >> question. >> >> In addition to what Ted said, I'm not sure how well Hadoop works with >> long-running jobs, particuarlly how well that interacts with its fault >> tolerance code. >> >> And more generally, if you're not doing map/reduce than you'd probably >> have to build your own fault tolerance methods. >> >> -John Heidemann >> >> > > > > > >
