So that's what *ShuffleVertexManager.determineParallelismAndApply* does ?
On Fri, Jan 24, 2014 at 11:35 AM, Gopal Vijayaraghavan <gop...@apache.org>wrote: > On Thu, Jan 23, 2014 at 7:25 PM, Manu Zhang <owenzhang1...@gmail.com> > wrote: > > > Another question is that, when running Hive over Tez, why is the number > of > > reducers not the same as that of Hive over MR, provided with the same > input > > data and configurations ? > > This is too big a note to write out into an email > > http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/ > > But in short, unlike MR, Tez enables you to set reducer count dynamically. > > Cheers, > Gopal > > > On Tue, Jan 21, 2014 at 10:30 AM, Bikas Saha <bi...@hortonworks.com> > wrote: > > > >> It would help us understand your situation if you could give a short > >> description of how you changes are going to speed up Map output. If it > is > >> generally useful then we could consider adding it the existing library > of > >> inputs and outputs. > >> > >> The number of physical outputs == the number of downstream consumers of > >> that partitioned data. Think of them as the number of reducers. So if > >> there are N reducers (and thus you will be partitioning the data N ways) > >> then number of physical output == N. > >> > >> The MRPartitioner config item (tez.runtime.num.expected.partitions) is > >> used to communicate the above information to the MRPartitioner. In the > >> above example you would set it to N so that the MRPartitioner would > >> partition the data N ways. > >> > >> Hive uses internal statistics to calculate the expected number of > >> partitions at compile time. However once Hive determines the number of > >> tasks (say reducers) then the partitioner will always get that value for > >> the number of partitions to create. > >> > >> Not sure what you mean by partitionId will exceed that number when > running > >> some jobs. Can you please elaborate? Do you mean that the > >> partitioner.getPartition(Key, Value, Partition) is getting a value for > >> Partition > num physical outputs? In that case, please check your > >> customized output code because that should be the one calling the > >> getPartition() method. > >> > >> Bikas > >> > >> -----Original Message----- > >> From: Manu Zhang [mailto:owenzhang1...@gmail.com] > >> Sent: Monday, January 20, 2014 6:10 PM > >> To: dev@tez.incubator.apache.org > >> Subject: How to decide number of partitions in a Map Output > >> > >> Hi all, > >> > >> I've been working on a customized Output which works like > >> OnFileSortedOutput but with optimizations that will speed up Map output. > >> > >> The issue is about the *number of partitions*. My current implementation > >> is set it to number of physicalOutputs but the *partitionId will exceed > >> that > >> number* when runnning some jobs. > >> > >> After referring to MRPartitioner, I found the number of partition is > set > >> to "tez.runtime.num.expected.partitions" (or 1 if null) . So what is the > >> difference between that property and physicalOutputs ? > >> > >> Also , when running Hive queries over Tez (with my customized output), a > >> Hive property "hive.exec.reducers.bytes.per.reducer" could also alter > the > >> number of partitions, according to my observation. > >> > >> Any ideas ? > >> Thanks > >> > >> Manu Zhang > >> > >> -- > >> CONFIDENTIALITY NOTICE > >> NOTICE: This message is intended for the use of the individual or > entity to > >> which it is addressed and may contain information that is confidential, > >> privileged and exempt from disclosure under applicable law. If the > reader > >> of this message is not the intended recipient, you are hereby notified > that > >> any printing, copying, dissemination, distribution, disclosure or > >> forwarding of this communication is strictly prohibited. If you have > >> received this communication in error, please contact the sender > immediately > >> and delete it from your system. Thank You. > >> > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >