Re: Strange behavior during Hive queries

Brad Heintz Thu, 17 Sep 2009 07:37:07 -0700

No - 2 mappers per node, 7 nodes = 14 mappers total.  Most jobs use 7 per
node (49 total).


On Thu, Sep 17, 2009 at 12:56 AM, Zheng Shao <[email protected]> wrote:

> You mean 14 mappers running concurrently, correct?
> How many mappers in total for the hive query?
>
> Zheng
>
>
> On Wed, Sep 16, 2009 at 6:50 AM, Brad Heintz <[email protected]>wrote:
>
>> There are 14 mappers spawned when I do a Hive query - over 7 nodes.  Other
>> jobs spawn 7 nodes per mapper (total of 49), rather than 2.
>>
>> Block size is default.
>>
>> I'll try the "describe extended" as soon as I get a chance.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <[email protected]>wrote:
>>
>>>  Can't seem to make head or tail of this. How many mappers does the job
>>> spaws? The explain plan seems to be fine. Can you also do a
>>>
>>> describe extended
>>>
>>> on both the input and the output table.
>>>
>>> Also what is the block size and how many hdfs nodes is this data spread
>>> over.
>>>
>>> Ashish
>>>  ------------------------------
>>> *From:* Brad Heintz [mailto:[email protected]]
>>> *Sent:* Monday, September 14, 2009 1:23 PM
>>>
>>> *To:* [email protected]
>>> *Subject:* Re: Strange behavior during Hive queries
>>>
>>> 436 files, each about 2GB.
>>>
>>>
>>> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[email protected]> wrote:
>>>
>>>>  Currently, hive uses 1 mapper per file – does your table have lots of
>>>> small files ? If yes, it might be a good idea to concatenate them into 
>>>> fewer
>>>> files
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* Ravi Jagannathan [mailto:[email protected]]
>>>> *Sent:* Monday, September 14, 2009 12:17 PM
>>>> *To:* Brad Heintz; [email protected]
>>>> *Subject:* RE: Strange behavior during Hive queries
>>>>
>>>>
>>>>
>>>>
>>>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
>>>>
>>>> Related issue , hive used too many mappers for very small table.
>>>>
>>>>
>>>>  ------------------------------
>>>>
>>>> *From:* Brad Heintz [mailto:[email protected]]
>>>> *Sent:* Monday, September 14, 2009 11:51 AM
>>>> *To:* [email protected]
>>>> *Subject:* Re: Strange behavior during Hive queries
>>>>
>>>>
>>>>
>>>> Ashish -
>>>>
>>>> mapred.min.split.size is set to 0 (according to the job.xml).  The data
>>>> are stored as uncompressed text files.
>>>>
>>>> Plan is attached.  I've been over it and didn't find anything useful,
>>>> but I'm also new to Hive and don't claim to understand everything I'm
>>>> looking at.  If you have any insight, I'd be most grateful.
>>>>
>>>> Many thanks,
>>>> - Brad
>>>>
>>>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[email protected]>
>>>> wrote:
>>>>
>>>> How is your data stored - sequencefiles, textfiles, compressed?? and
>>>> what are the value of mapred.min.split.size? Hive does not usually make a
>>>> decision on the number of mappers but it does try to make an estimate of 
>>>> the
>>>> number of reducers to use. Also if you send out the plan that would be
>>>> great.
>>>>
>>>>
>>>>
>>>> Ashish
>>>>
>>>>
>>>>  ------------------------------
>>>>
>>>> *From:* Brad Heintz [mailto:[email protected]]
>>>> *Sent:* Sunday, September 13, 2009 9:36 AM
>>>> *To:* [email protected]
>>>> *Subject:* Re: Strange behavior during Hive queries
>>>>
>>>> Edward -
>>>>
>>>> Yeah, I figured Hive had some decisions it made internally about how
>>>> many mappers & reducers it used, but this is acting on almost 1TB of data -
>>>> I don't see why it would use fewer mappers.  Also, this isn't a sort (which
>>>> would of course use only 1 reducer) - it's a straight count.
>>>>
>>>> Thanks,
>>>> - Brad
>>>>
>>>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[email protected]>
>>>> wrote:
>>>>
>>>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[email protected]> wrote:
>>>> > Hrm... sorry, I didn't read your original query closely enough.
>>>> >
>>>> > I'm not sure what could be causing this. The map.tasks.maximum
>>>> parameter
>>>> > shouldn't affect it at all - it only affects the number of slots on
>>>> the
>>>> > trackers.
>>>> >
>>>> > By any chance do you have mapred.max.maps.per.node set? This is a
>>>> > configuration parameter added by HADOOP-5170 - it's not in trunk or
>>>> the
>>>> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3
>>>> release this
>>>> > parameter could cause the behavior you're seeing. However, it would
>>>> > certainly not default to 2, so I'd be surprised if that were it.
>>>> >
>>>> > -Todd
>>>> >
>>>> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> Todd -
>>>> >>
>>>> >> Of course; it makes sense that it would be that way.  But I'm still
>>>> left
>>>> >> wondering why, then, my Hive queries are only using 2 mappers per
>>>> task
>>>> >> tracker when other jobs use 7.  I've gone so far as to diff the
>>>> job.xml
>>>> >> files from a regular job and a Hive query, and didn't turn up
>>>> anything -
>>>> >> though clearly, it has to be something Hive is doing.
>>>> >>
>>>> >> Thanks,
>>>> >> - Brad
>>>> >>
>>>> >>
>>>> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[email protected]>
>>>> wrote:
>>>> >>>
>>>> >>> Hi Brad,
>>>> >>>
>>>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>>> >>> TaskTracker when it starts up. It cannot be changed per-job.
>>>> >>>
>>>> >>> Hope that helps
>>>> >>> -Todd
>>>> >>>
>>>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[email protected]
>>>> >
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> TIA if anyone can point me in the right direction on this.
>>>> >>>>
>>>> >>>> I'm running a simple Hive query (a count on an external table
>>>> comprising
>>>> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
>>>> worker
>>>> >>>> node.  When I run regular MR jobs via "bin/hadoop jar
>>>> myJob.jar...", I see 7
>>>> >>>> mappers spawned on each worker.
>>>> >>>>
>>>> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> >>>> worker.
>>>> >>>>
>>>> >>>> When I do "set -v;" from the Hive command line, I see
>>>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>> >>>>
>>>> >>>> The job.xml for the Hive query shows
>>>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>> >>>>
>>>> >>>> The only lead I have is that the default for
>>>> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
>>>> overridden
>>>> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
>>>> this
>>>> >>>> variable everyplace I can think of (Hive command line with
>>>> "-hiveconf",
>>>> >>>> using set from the Hive prompt, et al) and nothing works.  I've
>>>> combed the
>>>> >>>> docs & mailing list, but haven't run across the answer.
>>>> >>>>
>>>> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
>>>> some
>>>> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
>>>> enough,
>>>> >>>> and I should just leave it alone?  Or is there some knob I can
>>>> fiddle to get
>>>> >>>> it to use my cluster at full power?
>>>> >>>>
>>>> >>>> Many thanks in advance,
>>>> >>>> - Brad
>>>> >>>>
>>>> >>>> --
>>>> >>>> Brad Heintz
>>>> >>>> [email protected]
>>>> >>>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Brad Heintz
>>>> >> [email protected]
>>>> >
>>>> >
>>>>
>>>> Hive does adjust some map/reduce settings based on the job size. Some
>>>> tasks like a sort might only require one map/reduce to work as well.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Brad Heintz
>>>> [email protected]
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Brad Heintz
>>>> [email protected]
>>>>
>>>
>>>
>>>
>>> --
>>> Brad Heintz
>>> [email protected]
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> [email protected]
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
Brad Heintz
[email protected]

Re: Strange behavior during Hive queries

Reply via email to