No - 2 mappers per node, 7 nodes = 14 mappers total. Most jobs use 7 per node (49 total).
On Thu, Sep 17, 2009 at 12:56 AM, Zheng Shao <[email protected]> wrote: > You mean 14 mappers running concurrently, correct? > How many mappers in total for the hive query? > > Zheng > > > On Wed, Sep 16, 2009 at 6:50 AM, Brad Heintz <[email protected]>wrote: > >> There are 14 mappers spawned when I do a Hive query - over 7 nodes. Other >> jobs spawn 7 nodes per mapper (total of 49), rather than 2. >> >> Block size is default. >> >> I'll try the "describe extended" as soon as I get a chance. >> >> Thanks, >> - Brad >> >> >> On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <[email protected]>wrote: >> >>> Can't seem to make head or tail of this. How many mappers does the job >>> spaws? The explain plan seems to be fine. Can you also do a >>> >>> describe extended >>> >>> on both the input and the output table. >>> >>> Also what is the block size and how many hdfs nodes is this data spread >>> over. >>> >>> Ashish >>> ------------------------------ >>> *From:* Brad Heintz [mailto:[email protected]] >>> *Sent:* Monday, September 14, 2009 1:23 PM >>> >>> *To:* [email protected] >>> *Subject:* Re: Strange behavior during Hive queries >>> >>> 436 files, each about 2GB. >>> >>> >>> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[email protected]> wrote: >>> >>>> Currently, hive uses 1 mapper per file – does your table have lots of >>>> small files ? If yes, it might be a good idea to concatenate them into >>>> fewer >>>> files >>>> >>>> >>>> >>>> >>>> >>>> *From:* Ravi Jagannathan [mailto:[email protected]] >>>> *Sent:* Monday, September 14, 2009 12:17 PM >>>> *To:* Brad Heintz; [email protected] >>>> *Subject:* RE: Strange behavior during Hive queries >>>> >>>> >>>> >>>> >>>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers >>>> >>>> Related issue , hive used too many mappers for very small table. >>>> >>>> >>>> ------------------------------ >>>> >>>> *From:* Brad Heintz [mailto:[email protected]] >>>> *Sent:* Monday, September 14, 2009 11:51 AM >>>> *To:* [email protected] >>>> *Subject:* Re: Strange behavior during Hive queries >>>> >>>> >>>> >>>> Ashish - >>>> >>>> mapred.min.split.size is set to 0 (according to the job.xml). The data >>>> are stored as uncompressed text files. >>>> >>>> Plan is attached. I've been over it and didn't find anything useful, >>>> but I'm also new to Hive and don't claim to understand everything I'm >>>> looking at. If you have any insight, I'd be most grateful. >>>> >>>> Many thanks, >>>> - Brad >>>> >>>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[email protected]> >>>> wrote: >>>> >>>> How is your data stored - sequencefiles, textfiles, compressed?? and >>>> what are the value of mapred.min.split.size? Hive does not usually make a >>>> decision on the number of mappers but it does try to make an estimate of >>>> the >>>> number of reducers to use. Also if you send out the plan that would be >>>> great. >>>> >>>> >>>> >>>> Ashish >>>> >>>> >>>> ------------------------------ >>>> >>>> *From:* Brad Heintz [mailto:[email protected]] >>>> *Sent:* Sunday, September 13, 2009 9:36 AM >>>> *To:* [email protected] >>>> *Subject:* Re: Strange behavior during Hive queries >>>> >>>> Edward - >>>> >>>> Yeah, I figured Hive had some decisions it made internally about how >>>> many mappers & reducers it used, but this is acting on almost 1TB of data - >>>> I don't see why it would use fewer mappers. Also, this isn't a sort (which >>>> would of course use only 1 reducer) - it's a straight count. >>>> >>>> Thanks, >>>> - Brad >>>> >>>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[email protected]> >>>> wrote: >>>> >>>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[email protected]> wrote: >>>> > Hrm... sorry, I didn't read your original query closely enough. >>>> > >>>> > I'm not sure what could be causing this. The map.tasks.maximum >>>> parameter >>>> > shouldn't affect it at all - it only affects the number of slots on >>>> the >>>> > trackers. >>>> > >>>> > By any chance do you have mapred.max.maps.per.node set? This is a >>>> > configuration parameter added by HADOOP-5170 - it's not in trunk or >>>> the >>>> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 >>>> release this >>>> > parameter could cause the behavior you're seeing. However, it would >>>> > certainly not default to 2, so I'd be surprised if that were it. >>>> > >>>> > -Todd >>>> > >>>> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[email protected]> >>>> wrote: >>>> >> >>>> >> Todd - >>>> >> >>>> >> Of course; it makes sense that it would be that way. But I'm still >>>> left >>>> >> wondering why, then, my Hive queries are only using 2 mappers per >>>> task >>>> >> tracker when other jobs use 7. I've gone so far as to diff the >>>> job.xml >>>> >> files from a regular job and a Hive query, and didn't turn up >>>> anything - >>>> >> though clearly, it has to be something Hive is doing. >>>> >> >>>> >> Thanks, >>>> >> - Brad >>>> >> >>>> >> >>>> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[email protected]> >>>> wrote: >>>> >>> >>>> >>> Hi Brad, >>>> >>> >>>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the >>>> >>> TaskTracker when it starts up. It cannot be changed per-job. >>>> >>> >>>> >>> Hope that helps >>>> >>> -Todd >>>> >>> >>>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[email protected] >>>> > >>>> >>> wrote: >>>> >>>> >>>> >>>> TIA if anyone can point me in the right direction on this. >>>> >>>> >>>> >>>> I'm running a simple Hive query (a count on an external table >>>> comprising >>>> >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >>>> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per >>>> worker >>>> >>>> node. When I run regular MR jobs via "bin/hadoop jar >>>> myJob.jar...", I see 7 >>>> >>>> mappers spawned on each worker. >>>> >>>> >>>> >>>> The problem: When I run my Hive query, I see 2 mappers spawned per >>>> >>>> worker. >>>> >>>> >>>> >>>> When I do "set -v;" from the Hive command line, I see >>>> >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> >>>> >>>> The job.xml for the Hive query shows >>>> >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> >>>> >>>> The only lead I have is that the default for >>>> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's >>>> overridden >>>> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding >>>> this >>>> >>>> variable everyplace I can think of (Hive command line with >>>> "-hiveconf", >>>> >>>> using set from the Hive prompt, et al) and nothing works. I've >>>> combed the >>>> >>>> docs & mailing list, but haven't run across the answer. >>>> >>>> >>>> >>>> Does anyone have any ideas what (if anything) I'm missing? Is this >>>> some >>>> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is >>>> enough, >>>> >>>> and I should just leave it alone? Or is there some knob I can >>>> fiddle to get >>>> >>>> it to use my cluster at full power? >>>> >>>> >>>> >>>> Many thanks in advance, >>>> >>>> - Brad >>>> >>>> >>>> >>>> -- >>>> >>>> Brad Heintz >>>> >>>> [email protected] >>>> >>> >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Brad Heintz >>>> >> [email protected] >>>> > >>>> > >>>> >>>> Hive does adjust some map/reduce settings based on the job size. Some >>>> tasks like a sort might only require one map/reduce to work as well. >>>> >>>> >>>> >>>> >>>> -- >>>> Brad Heintz >>>> [email protected] >>>> >>>> >>>> >>>> >>>> -- >>>> Brad Heintz >>>> [email protected] >>>> >>> >>> >>> >>> -- >>> Brad Heintz >>> [email protected] >>> >> >> >> >> -- >> Brad Heintz >> [email protected] >> > > > > -- > Yours, > Zheng > -- Brad Heintz [email protected]
