On Mon, Mar 8, 2010 at 2:38 PM, Ryan LeCompte <[email protected]> wrote: > Hey guys, > > Here's a scenario: > > Cluster allows a max of 90 mappers and 90 reducers. > > 1) Submit a large job, which immediately utilizes all mappers and all > reducers. > 2) 10 minutes later, submit a second job. We notice that the cluster will > eventually allow the mapper portion of both jobs to be shared (so they both > run concurrently). > > HOWEVER... The first job hogs all of the reducers and never "lets go" of > them so that the other query can have its reducers running. > > Any idea how to overcome this? Is there a way to tell Hive or Hadoop to "let > go" of reducers that are currently running? > > Should I limit the max reducers that a single job can use? How? > > Thanks, > Ryan > >
Ryan, I think most of this is in hadoop configuration. You should be able to do: set mapred.reduce.tasks=5; query ; Other switches tell hive how much data each reducer should handle. We are using the fair share scheduler. From reading some Jira's. I do not think hadoop supports true preemption yet. I spoke with some Facebooker's at hadoop World NYC "got around" this (and all problems) by running multiple job trackers. Of course this is a major architectural decision . Edward
