Thanks, this is extremely helpful.
On Sat, Jan 25, 2014 at 1:45 AM, Bikas Saha <bi...@hortonworks.com> wrote: > Set tez.am.shuffle-vertex-manager.enable.auto-parallel=true for tez > auto-reduce. > > >> About the Hive compiler feature, could you elaborate a bit more ? Do > you mean the *SetReducerParallelism *class ? In one case, the number of > reducers of Hive-over-MR is exactly *(number of bytes read from HDFS / > hive.exec.reducers.bytes.per.reducer)* by my calculation, which is how > SetPeducerParallelism does it. > > + Gunther for the hive question. In case Gunther does not respond please > email the hive user/dev lists with your question. > > Bikas > > -----Original Message----- > From: Manu Zhang [mailto:owenzhang1...@gmail.com] > Sent: Friday, January 24, 2014 2:22 AM > To: dev@tez.incubator.apache.org > Subject: Re: How to decide number of partitions in a Map Output > > Hi Bikas, > > Could you please specify the property to turn on that feature in Tez ? > About the Hive compiler feature, could you elaborate a bit more ? Do you > mean the *SetReducerParallelism *class ? In one case, the number of > reducers of Hive-over-MR is exactly *(number of bytes read from HDFS / > hive.exec.reducers.bytes.per.reducer)* by my calculation, which is how > SetPeducerParallelism does it. > > Thanks, > Manu > > > On Fri, Jan 24, 2014 at 1:09 PM, Bikas Saha <bi...@hortonworks.com> wrote: > > > This feature is turned off by default in Tez. So unless you have > > turned it on (or hive turns it on), the number of reducers is probably > > different because latest Hive compiler tries to determine the number > > of reducers via table statistics at compile time. It may be that this > > Hive compiler feature is on by default for Hive-on-Tez compilation but > > off by default for Hive-on-MR compilation. > > > > Bikas > > > > -----Original Message----- > > From: Hitesh Shah [mailto:hit...@apache.org] > > Sent: Thursday, January 23, 2014 8:27 PM > > To: dev@tez.incubator.apache.org > > Subject: Re: How to decide number of partitions in a Map Output > > > > Yes. It looks at the size of outputs generated by the tasks in the > > previous vertex and determines the no. of tasks to run based on the > > configured amount of input data size per task. > > > > -- Hitesh > > > > On Jan 23, 2014, at 7:59 PM, Manu Zhang wrote: > > > > > So that's what *ShuffleVertexManager.determineParallelismAndApply* > > > does > > ? > > > > > > > > > On Fri, Jan 24, 2014 at 11:35 AM, Gopal Vijayaraghavan > > <gop...@apache.org>wrote: > > > > > >> On Thu, Jan 23, 2014 at 7:25 PM, Manu Zhang > > >> <owenzhang1...@gmail.com> > > >> wrote: > > >> > > >>> Another question is that, when running Hive over Tez, why is the > > >>> number > > >> of > > >>> reducers not the same as that of Hive over MR, provided with the > > >>> same > > >> input > > >>> data and configurations ? > > >> > > >> This is too big a note to write out into an email > > >> > > >> http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguratio > > >> n/ > > >> > > >> But in short, unlike MR, Tez enables you to set reducer count > > dynamically. > > >> > > >> Cheers, > > >> Gopal > > >> > > >>> On Tue, Jan 21, 2014 at 10:30 AM, Bikas Saha > > >>> <bi...@hortonworks.com> > > >> wrote: > > >>> > > >>>> It would help us understand your situation if you could give a > > >>>> short description of how you changes are going to speed up Map > > >>>> output. If it > > >> is > > >>>> generally useful then we could consider adding it the existing > > >>>> library > > >> of > > >>>> inputs and outputs. > > >>>> > > >>>> The number of physical outputs == the number of downstream > > >>>> consumers of that partitioned data. Think of them as the number > > >>>> of reducers. So if there are N reducers (and thus you will be > > >>>> partitioning the data N ways) then number of physical output == N. > > >>>> > > >>>> The MRPartitioner config item > > >>>> (tez.runtime.num.expected.partitions) > > >>>> is used to communicate the above information to the MRPartitioner. > > >>>> In the above example you would set it to N so that the > > >>>> MRPartitioner would partition the data N ways. > > >>>> > > >>>> Hive uses internal statistics to calculate the expected number of > > >>>> partitions at compile time. However once Hive determines the > > >>>> number of tasks (say reducers) then the partitioner will always > > >>>> get that value for the number of partitions to create. > > >>>> > > >>>> Not sure what you mean by partitionId will exceed that number > > >>>> when > > >> running > > >>>> some jobs. Can you please elaborate? Do you mean that the > > >>>> partitioner.getPartition(Key, Value, Partition) is getting a > > >>>> value for Partition > num physical outputs? In that case, please > > >>>> check your customized output code because that should be the one > > >>>> calling the > > >>>> getPartition() method. > > >>>> > > >>>> Bikas > > >>>> > > >>>> -----Original Message----- > > >>>> From: Manu Zhang [mailto:owenzhang1...@gmail.com] > > >>>> Sent: Monday, January 20, 2014 6:10 PM > > >>>> To: dev@tez.incubator.apache.org > > >>>> Subject: How to decide number of partitions in a Map Output > > >>>> > > >>>> Hi all, > > >>>> > > >>>> I've been working on a customized Output which works like > > >>>> OnFileSortedOutput but with optimizations that will speed up Map > > output. > > >>>> > > >>>> The issue is about the *number of partitions*. My current > > >>>> implementation is set it to number of physicalOutputs but the > > >>>> *partitionId will exceed that > > >>>> number* when runnning some jobs. > > >>>> > > >>>> After referring to MRPartitioner, I found the number of > > >>>> partition is > > >> set > > >>>> to "tez.runtime.num.expected.partitions" (or 1 if null) . So what > > >>>> is the difference between that property and physicalOutputs ? > > >>>> > > >>>> Also , when running Hive queries over Tez (with my customized > > >>>> output), a Hive property "hive.exec.reducers.bytes.per.reducer" > > >>>> could also alter > > >> the > > >>>> number of partitions, according to my observation. > > >>>> > > >>>> Any ideas ? > > >>>> Thanks > > >>>> > > >>>> Manu Zhang > > >>>> > > >>>> -- > > >>>> CONFIDENTIALITY NOTICE > > >>>> NOTICE: This message is intended for the use of the individual or > > >> entity to > > >>>> which it is addressed and may contain information that is > > >>>> confidential, privileged and exempt from disclosure under > > >>>> applicable law. If the > > >> reader > > >>>> of this message is not the intended recipient, you are hereby > > >>>> notified > > >> that > > >>>> any printing, copying, dissemination, distribution, disclosure or > > >>>> forwarding of this communication is strictly prohibited. If you > > >>>> have received this communication in error, please contact the > > >>>> sender > > >> immediately > > >>>> and delete it from your system. Thank You. > > >>>> > > >> > > >> -- > > >> CONFIDENTIALITY NOTICE > > >> NOTICE: This message is intended for the use of the individual or > > >> entity to which it is addressed and may contain information that is > > >> confidential, privileged and exempt from disclosure under > > >> applicable law. If the reader of this message is not the intended > > >> recipient, you are hereby notified that any printing, copying, > > >> dissemination, distribution, disclosure or forwarding of this > > >> communication is strictly prohibited. If you have received this > > >> communication in error, please contact the sender immediately and > > >> delete it from your > > system. Thank You. > > >> > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or > > entity to which it is addressed and may contain information that is > > confidential, privileged and exempt from disclosure under applicable > > law. If the reader of this message is not the intended recipient, you > > are hereby notified that any printing, copying, dissemination, > > distribution, disclosure or forwarding of this communication is > > strictly prohibited. If you have received this communication in error, > > please contact the sender immediately and delete it from your system. > Thank You. > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >