Re: Strange behavior during Hive queries

Todd Lipcon Fri, 11 Sep 2009 14:28:59 -0700

Hrm... sorry, I didn't read your original query closely enough.

I'm not sure what could be causing this. The map.tasks.maximum parameter
shouldn't affect it at all - it only affects the number of slots on the
trackers.


By any chance do you have mapred.max.maps.per.node set? This is a
configuration parameter added by HADOOP-5170 - it's not in trunk or the
vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
parameter could cause the behavior you're seeing. However, it would
certainly not default to 2, so I'd be surprised if that were it.

-Todd

On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[email protected]> wrote:

> Todd -
>
> Of course; it makes sense that it would be that way.  But I'm still left
> wondering why, then, my Hive queries are only using 2 mappers per task
> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
> files from a regular job and a Hive query, and didn't turn up anything -
> though clearly, it has to be something Hive is doing.
>
> Thanks,
> - Brad
>
>
>
> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[email protected]> wrote:
>
>> Hi Brad,
>>
>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>> TaskTracker when it starts up. It cannot be changed per-job.
>>
>> Hope that helps
>> -Todd
>>
>>
>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[email protected]>wrote:
>>
>>> TIA if anyone can point me in the right direction on this.
>>>
>>> I'm running a simple Hive query (a count on an external table comprising
>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>> mappers spawned on each worker.
>>>
>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>> worker.
>>>
>>> When I do "set -v;" from the Hive command line, I see
>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>
>>> The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum
>>> = 7.
>>>
>>> The only lead I have is that the default for
>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>> docs & mailing list, but haven't run across the answer.
>>>
>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>> it to use my cluster at full power?
>>>
>>> Many thanks in advance,
>>> - Brad
>>>
>>> --
>>> Brad Heintz
>>> [email protected]
>>>
>>
>>
>
>
> --
> Brad Heintz
> [email protected]
>

Re: Strange behavior during Hive queries

Reply via email to