Re: Strange behavior during Hive queries

Brad Heintz Wed, 16 Sep 2009 06:51:10 -0700

There are 14 mappers spawned when I do a Hive query - over 7 nodes.  Other
jobs spawn 7 nodes per mapper (total of 49), rather than 2.


Block size is default.

I'll try the "describe extended" as soon as I get a chance.

Thanks,
- Brad

On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <[email protected]> wrote:

>  Can't seem to make head or tail of this. How many mappers does the job
> spaws? The explain plan seems to be fine. Can you also do a
>
> describe extended
>
> on both the input and the output table.
>
> Also what is the block size and how many hdfs nodes is this data spread
> over.
>
> Ashish
>  ------------------------------
> *From:* Brad Heintz [mailto:[email protected]]
> *Sent:* Monday, September 14, 2009 1:23 PM
>
> *To:* [email protected]
> *Subject:* Re: Strange behavior during Hive queries
>
> 436 files, each about 2GB.
>
>
> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[email protected]> wrote:
>
>>  Currently, hive uses 1 mapper per file – does your table have lots of
>> small files ? If yes, it might be a good idea to concatenate them into fewer
>> files
>>
>>
>>
>>
>>
>> *From:* Ravi Jagannathan [mailto:[email protected]]
>> *Sent:* Monday, September 14, 2009 12:17 PM
>> *To:* Brad Heintz; [email protected]
>> *Subject:* RE: Strange behavior during Hive queries
>>
>>
>>
>>
>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
>>
>> Related issue , hive used too many mappers for very small table.
>>
>>
>>  ------------------------------
>>
>> *From:* Brad Heintz [mailto:[email protected]]
>> *Sent:* Monday, September 14, 2009 11:51 AM
>> *To:* [email protected]
>> *Subject:* Re: Strange behavior during Hive queries
>>
>>
>>
>> Ashish -
>>
>> mapred.min.split.size is set to 0 (according to the job.xml).  The data
>> are stored as uncompressed text files.
>>
>> Plan is attached.  I've been over it and didn't find anything useful, but
>> I'm also new to Hive and don't claim to understand everything I'm looking
>> at.  If you have any insight, I'd be most grateful.
>>
>> Many thanks,
>> - Brad
>>
>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[email protected]>
>> wrote:
>>
>> How is your data stored - sequencefiles, textfiles, compressed?? and what
>> are the value of mapred.min.split.size? Hive does not usually make a
>> decision on the number of mappers but it does try to make an estimate of the
>> number of reducers to use. Also if you send out the plan that would be
>> great.
>>
>>
>>
>> Ashish
>>
>>
>>  ------------------------------
>>
>> *From:* Brad Heintz [mailto:[email protected]]
>> *Sent:* Sunday, September 13, 2009 9:36 AM
>> *To:* [email protected]
>> *Subject:* Re: Strange behavior during Hive queries
>>
>> Edward -
>>
>> Yeah, I figured Hive had some decisions it made internally about how many
>> mappers & reducers it used, but this is acting on almost 1TB of data - I
>> don't see why it would use fewer mappers.  Also, this isn't a sort (which
>> would of course use only 1 reducer) - it's a straight count.
>>
>> Thanks,
>> - Brad
>>
>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[email protected]>
>> wrote:
>>
>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[email protected]> wrote:
>> > Hrm... sorry, I didn't read your original query closely enough.
>> >
>> > I'm not sure what could be causing this. The map.tasks.maximum parameter
>> > shouldn't affect it at all - it only affects the number of slots on the
>> > trackers.
>> >
>> > By any chance do you have mapred.max.maps.per.node set? This is a
>> > configuration parameter added by HADOOP-5170 - it's not in trunk or the
>> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release
>> this
>> > parameter could cause the behavior you're seeing. However, it would
>> > certainly not default to 2, so I'd be surprised if that were it.
>> >
>> > -Todd
>> >
>> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[email protected]>
>> wrote:
>> >>
>> >> Todd -
>> >>
>> >> Of course; it makes sense that it would be that way.  But I'm still
>> left
>> >> wondering why, then, my Hive queries are only using 2 mappers per task
>> >> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> >> files from a regular job and a Hive query, and didn't turn up anything
>> -
>> >> though clearly, it has to be something Hive is doing.
>> >>
>> >> Thanks,
>> >> - Brad
>> >>
>> >>
>> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[email protected]>
>> wrote:
>> >>>
>> >>> Hi Brad,
>> >>>
>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>> >>> TaskTracker when it starts up. It cannot be changed per-job.
>> >>>
>> >>> Hope that helps
>> >>> -Todd
>> >>>
>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> TIA if anyone can point me in the right direction on this.
>> >>>>
>> >>>> I'm running a simple Hive query (a count on an external table
>> comprising
>> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
>> worker
>> >>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...",
>> I see 7
>> >>>> mappers spawned on each worker.
>> >>>>
>> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>> >>>> worker.
>> >>>>
>> >>>> When I do "set -v;" from the Hive command line, I see
>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>> >>>>
>> >>>> The job.xml for the Hive query shows
>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>> >>>>
>> >>>> The only lead I have is that the default for
>> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
>> overridden
>> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
>> this
>> >>>> variable everyplace I can think of (Hive command line with
>> "-hiveconf",
>> >>>> using set from the Hive prompt, et al) and nothing works.  I've
>> combed the
>> >>>> docs & mailing list, but haven't run across the answer.
>> >>>>
>> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
>> some
>> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
>> enough,
>> >>>> and I should just leave it alone?  Or is there some knob I can fiddle
>> to get
>> >>>> it to use my cluster at full power?
>> >>>>
>> >>>> Many thanks in advance,
>> >>>> - Brad
>> >>>>
>> >>>> --
>> >>>> Brad Heintz
>> >>>> [email protected]
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Brad Heintz
>> >> [email protected]
>> >
>> >
>>
>> Hive does adjust some map/reduce settings based on the job size. Some
>> tasks like a sort might only require one map/reduce to work as well.
>>
>>
>>
>>
>> --
>> Brad Heintz
>> [email protected]
>>
>>
>>
>>
>> --
>> Brad Heintz
>> [email protected]
>>
>
>
>
> --
> Brad Heintz
> [email protected]
>



-- 
Brad Heintz
[email protected]

Re: Strange behavior during Hive queries

Reply via email to