Re: Strange behavior during Hive queries

Brad Heintz Sun, 13 Sep 2009 09:36:27 -0700

Edward -

Yeah, I figured Hive had some decisions it made internally about how many
mappers & reducers it used, but this is acting on almost 1TB of data - I
don't see why it would use fewer mappers.  Also, this isn't a sort (which
would of course use only 1 reducer) - it's a straight count.


Thanks,
- Brad

On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[email protected]>wrote:

> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[email protected]> wrote:
> > Hrm... sorry, I didn't read your original query closely enough.
> >
> > I'm not sure what could be causing this. The map.tasks.maximum parameter
> > shouldn't affect it at all - it only affects the number of slots on the
> > trackers.
> >
> > By any chance do you have mapred.max.maps.per.node set? This is a
> > configuration parameter added by HADOOP-5170 - it's not in trunk or the
> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release
> this
> > parameter could cause the behavior you're seeing. However, it would
> > certainly not default to 2, so I'd be surprised if that were it.
> >
> > -Todd
> >
> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[email protected]>
> wrote:
> >>
> >> Todd -
> >>
> >> Of course; it makes sense that it would be that way.  But I'm still left
> >> wondering why, then, my Hive queries are only using 2 mappers per task
> >> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
> >> files from a regular job and a Hive query, and didn't turn up anything -
> >> though clearly, it has to be something Hive is doing.
> >>
> >> Thanks,
> >> - Brad
> >>
> >>
> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[email protected]> wrote:
> >>>
> >>> Hi Brad,
> >>>
> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
> >>> TaskTracker when it starts up. It cannot be changed per-job.
> >>>
> >>> Hope that helps
> >>> -Todd
> >>>
> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[email protected]>
> >>> wrote:
> >>>>
> >>>> TIA if anyone can point me in the right direction on this.
> >>>>
> >>>> I'm running a simple Hive query (a count on an external table
> comprising
> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
> worker
> >>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I
> see 7
> >>>> mappers spawned on each worker.
> >>>>
> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
> >>>> worker.
> >>>>
> >>>> When I do "set -v;" from the Hive command line, I see
> >>>> mapred.tasktracker.map.tasks.maximum = 7.
> >>>>
> >>>> The job.xml for the Hive query shows
> >>>> mapred.tasktracker.map.tasks.maximum = 7.
> >>>>
> >>>> The only lead I have is that the default for
> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
> overridden
> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
> this
> >>>> variable everyplace I can think of (Hive command line with
> "-hiveconf",
> >>>> using set from the Hive prompt, et al) and nothing works.  I've combed
> the
> >>>> docs & mailing list, but haven't run across the answer.
> >>>>
> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
> some
> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
> enough,
> >>>> and I should just leave it alone?  Or is there some knob I can fiddle
> to get
> >>>> it to use my cluster at full power?
> >>>>
> >>>> Many thanks in advance,
> >>>> - Brad
> >>>>
> >>>> --
> >>>> Brad Heintz
> >>>> [email protected]
> >>>
> >>
> >>
> >>
> >> --
> >> Brad Heintz
> >> [email protected]
> >
> >
>
> Hive does adjust some map/reduce settings based on the job size. Some
> tasks like a sort might only require one map/reduce to work as well.
>



-- 
Brad Heintz
[email protected]

Re: Strange behavior during Hive queries

Reply via email to