436 files, each about 2GB.
On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[email protected]> wrote: > Currently, hive uses 1 mapper per file – does your table have lots of > small files ? If yes, it might be a good idea to concatenate them into fewer > files > > > > > > *From:* Ravi Jagannathan [mailto:[email protected]] > *Sent:* Monday, September 14, 2009 12:17 PM > *To:* Brad Heintz; [email protected] > *Subject:* RE: Strange behavior during Hive queries > > > > > http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers > > Related issue , hive used too many mappers for very small table. > > > ------------------------------ > > *From:* Brad Heintz [mailto:[email protected]] > *Sent:* Monday, September 14, 2009 11:51 AM > *To:* [email protected] > *Subject:* Re: Strange behavior during Hive queries > > > > Ashish - > > mapred.min.split.size is set to 0 (according to the job.xml). The data are > stored as uncompressed text files. > > Plan is attached. I've been over it and didn't find anything useful, but > I'm also new to Hive and don't claim to understand everything I'm looking > at. If you have any insight, I'd be most grateful. > > Many thanks, > - Brad > > On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[email protected]> > wrote: > > How is your data stored - sequencefiles, textfiles, compressed?? and what > are the value of mapred.min.split.size? Hive does not usually make a > decision on the number of mappers but it does try to make an estimate of the > number of reducers to use. Also if you send out the plan that would be > great. > > > > Ashish > > > ------------------------------ > > *From:* Brad Heintz [mailto:[email protected]] > *Sent:* Sunday, September 13, 2009 9:36 AM > *To:* [email protected] > *Subject:* Re: Strange behavior during Hive queries > > Edward - > > Yeah, I figured Hive had some decisions it made internally about how many > mappers & reducers it used, but this is acting on almost 1TB of data - I > don't see why it would use fewer mappers. Also, this isn't a sort (which > would of course use only 1 reducer) - it's a straight count. > > Thanks, > - Brad > > On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[email protected]> > wrote: > > On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[email protected]> wrote: > > Hrm... sorry, I didn't read your original query closely enough. > > > > I'm not sure what could be causing this. The map.tasks.maximum parameter > > shouldn't affect it at all - it only affects the number of slots on the > > trackers. > > > > By any chance do you have mapred.max.maps.per.node set? This is a > > configuration parameter added by HADOOP-5170 - it's not in trunk or the > > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release > this > > parameter could cause the behavior you're seeing. However, it would > > certainly not default to 2, so I'd be surprised if that were it. > > > > -Todd > > > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[email protected]> > wrote: > >> > >> Todd - > >> > >> Of course; it makes sense that it would be that way. But I'm still left > >> wondering why, then, my Hive queries are only using 2 mappers per task > >> tracker when other jobs use 7. I've gone so far as to diff the job.xml > >> files from a regular job and a Hive query, and didn't turn up anything - > >> though clearly, it has to be something Hive is doing. > >> > >> Thanks, > >> - Brad > >> > >> > >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[email protected]> wrote: > >>> > >>> Hi Brad, > >>> > >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the > >>> TaskTracker when it starts up. It cannot be changed per-job. > >>> > >>> Hope that helps > >>> -Todd > >>> > >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[email protected]> > >>> wrote: > >>>> > >>>> TIA if anyone can point me in the right direction on this. > >>>> > >>>> I'm running a simple Hive query (a count on an external table > comprising > >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies > >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per > worker > >>>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I > see 7 > >>>> mappers spawned on each worker. > >>>> > >>>> The problem: When I run my Hive query, I see 2 mappers spawned per > >>>> worker. > >>>> > >>>> When I do "set -v;" from the Hive command line, I see > >>>> mapred.tasktracker.map.tasks.maximum = 7. > >>>> > >>>> The job.xml for the Hive query shows > >>>> mapred.tasktracker.map.tasks.maximum = 7. > >>>> > >>>> The only lead I have is that the default for > >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's > overridden > >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding > this > >>>> variable everyplace I can think of (Hive command line with > "-hiveconf", > >>>> using set from the Hive prompt, et al) and nothing works. I've combed > the > >>>> docs & mailing list, but haven't run across the answer. > >>>> > >>>> Does anyone have any ideas what (if anything) I'm missing? Is this > some > >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is > enough, > >>>> and I should just leave it alone? Or is there some knob I can fiddle > to get > >>>> it to use my cluster at full power? > >>>> > >>>> Many thanks in advance, > >>>> - Brad > >>>> > >>>> -- > >>>> Brad Heintz > >>>> [email protected] > >>> > >> > >> > >> > >> -- > >> Brad Heintz > >> [email protected] > > > > > > Hive does adjust some map/reduce settings based on the job size. Some > tasks like a sort might only require one map/reduce to work as well. > > > > > -- > Brad Heintz > [email protected] > > > > > -- > Brad Heintz > [email protected] > -- Brad Heintz [email protected]
