Hi Bikas, Could you please specify the property to turn on that feature in Tez ? About the Hive compiler feature, could you elaborate a bit more ? Do you mean the *SetReducerParallelism *class ? In one case, the number of reducers of Hive-over-MR is exactly *(number of bytes read from HDFS / hive.exec.reducers.bytes.per.reducer)* by my calculation, which is how SetPeducerParallelism does it.
Thanks, Manu On Fri, Jan 24, 2014 at 1:09 PM, Bikas Saha <bi...@hortonworks.com> wrote: > This feature is turned off by default in Tez. So unless you have turned it > on (or hive turns it on), the number of reducers is probably different > because latest Hive compiler tries to determine the number of reducers via > table statistics at compile time. It may be that this Hive compiler > feature is on by default for Hive-on-Tez compilation but off by default > for Hive-on-MR compilation. > > Bikas > > -----Original Message----- > From: Hitesh Shah [mailto:hit...@apache.org] > Sent: Thursday, January 23, 2014 8:27 PM > To: dev@tez.incubator.apache.org > Subject: Re: How to decide number of partitions in a Map Output > > Yes. It looks at the size of outputs generated by the tasks in the > previous vertex and determines the no. of tasks to run based on the > configured amount of input data size per task. > > -- Hitesh > > On Jan 23, 2014, at 7:59 PM, Manu Zhang wrote: > > > So that's what *ShuffleVertexManager.determineParallelismAndApply* does > ? > > > > > > On Fri, Jan 24, 2014 at 11:35 AM, Gopal Vijayaraghavan > <gop...@apache.org>wrote: > > > >> On Thu, Jan 23, 2014 at 7:25 PM, Manu Zhang <owenzhang1...@gmail.com> > >> wrote: > >> > >>> Another question is that, when running Hive over Tez, why is the > >>> number > >> of > >>> reducers not the same as that of Hive over MR, provided with the > >>> same > >> input > >>> data and configurations ? > >> > >> This is too big a note to write out into an email > >> > >> http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/ > >> > >> But in short, unlike MR, Tez enables you to set reducer count > dynamically. > >> > >> Cheers, > >> Gopal > >> > >>> On Tue, Jan 21, 2014 at 10:30 AM, Bikas Saha <bi...@hortonworks.com> > >> wrote: > >>> > >>>> It would help us understand your situation if you could give a > >>>> short description of how you changes are going to speed up Map > >>>> output. If it > >> is > >>>> generally useful then we could consider adding it the existing > >>>> library > >> of > >>>> inputs and outputs. > >>>> > >>>> The number of physical outputs == the number of downstream > >>>> consumers of that partitioned data. Think of them as the number of > >>>> reducers. So if there are N reducers (and thus you will be > >>>> partitioning the data N ways) then number of physical output == N. > >>>> > >>>> The MRPartitioner config item (tez.runtime.num.expected.partitions) > >>>> is used to communicate the above information to the MRPartitioner. > >>>> In the above example you would set it to N so that the > >>>> MRPartitioner would partition the data N ways. > >>>> > >>>> Hive uses internal statistics to calculate the expected number of > >>>> partitions at compile time. However once Hive determines the number > >>>> of tasks (say reducers) then the partitioner will always get that > >>>> value for the number of partitions to create. > >>>> > >>>> Not sure what you mean by partitionId will exceed that number when > >> running > >>>> some jobs. Can you please elaborate? Do you mean that the > >>>> partitioner.getPartition(Key, Value, Partition) is getting a value > >>>> for Partition > num physical outputs? In that case, please check > >>>> your customized output code because that should be the one calling > >>>> the > >>>> getPartition() method. > >>>> > >>>> Bikas > >>>> > >>>> -----Original Message----- > >>>> From: Manu Zhang [mailto:owenzhang1...@gmail.com] > >>>> Sent: Monday, January 20, 2014 6:10 PM > >>>> To: dev@tez.incubator.apache.org > >>>> Subject: How to decide number of partitions in a Map Output > >>>> > >>>> Hi all, > >>>> > >>>> I've been working on a customized Output which works like > >>>> OnFileSortedOutput but with optimizations that will speed up Map > output. > >>>> > >>>> The issue is about the *number of partitions*. My current > >>>> implementation is set it to number of physicalOutputs but the > >>>> *partitionId will exceed that > >>>> number* when runnning some jobs. > >>>> > >>>> After referring to MRPartitioner, I found the number of partition > >>>> is > >> set > >>>> to "tez.runtime.num.expected.partitions" (or 1 if null) . So what > >>>> is the difference between that property and physicalOutputs ? > >>>> > >>>> Also , when running Hive queries over Tez (with my customized > >>>> output), a Hive property "hive.exec.reducers.bytes.per.reducer" > >>>> could also alter > >> the > >>>> number of partitions, according to my observation. > >>>> > >>>> Any ideas ? > >>>> Thanks > >>>> > >>>> Manu Zhang > >>>> > >>>> -- > >>>> CONFIDENTIALITY NOTICE > >>>> NOTICE: This message is intended for the use of the individual or > >> entity to > >>>> which it is addressed and may contain information that is > >>>> confidential, privileged and exempt from disclosure under > >>>> applicable law. If the > >> reader > >>>> of this message is not the intended recipient, you are hereby > >>>> notified > >> that > >>>> any printing, copying, dissemination, distribution, disclosure or > >>>> forwarding of this communication is strictly prohibited. If you > >>>> have received this communication in error, please contact the > >>>> sender > >> immediately > >>>> and delete it from your system. Thank You. > >>>> > >> > >> -- > >> CONFIDENTIALITY NOTICE > >> NOTICE: This message is intended for the use of the individual or > >> entity to which it is addressed and may contain information that is > >> confidential, privileged and exempt from disclosure under applicable > >> law. If the reader of this message is not the intended recipient, you > >> are hereby notified that any printing, copying, dissemination, > >> distribution, disclosure or forwarding of this communication is > >> strictly prohibited. If you have received this communication in > >> error, please contact the sender immediately and delete it from your > system. Thank You. > >> > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >