Re: How to decide number of partitions in a Map Output

Manu Zhang Fri, 24 Jan 2014 02:23:58 -0800

Hi Bikas,

Could you please specify the property to turn on that feature in Tez ?
About the Hive compiler feature, could you elaborate a bit more ? Do you
mean the *SetReducerParallelism *class ? In one case,  the number of
reducers of Hive-over-MR is exactly *(number of bytes read from HDFS
/ hive.exec.reducers.bytes.per.reducer)* by my calculation, which is how
SetPeducerParallelism does it.


Thanks,
Manu


On Fri, Jan 24, 2014 at 1:09 PM, Bikas Saha <bi...@hortonworks.com> wrote:

> This feature is turned off by default in Tez. So unless you have turned it
> on (or hive turns it on), the number of reducers is probably different
> because latest Hive compiler tries to determine the number of reducers via
> table statistics at compile time. It may be that this Hive compiler
> feature is on by default for Hive-on-Tez compilation but off by default
> for Hive-on-MR compilation.
>
> Bikas
>
> -----Original Message-----
> From: Hitesh Shah [mailto:hit...@apache.org]
> Sent: Thursday, January 23, 2014 8:27 PM
> To: dev@tez.incubator.apache.org
> Subject: Re: How to decide number of partitions in a Map Output
>
> Yes. It looks at the size of outputs generated by the tasks in the
> previous vertex and determines the no. of tasks to run based on the
> configured amount of input data size per task.
>
> -- Hitesh
>
> On Jan 23, 2014, at 7:59 PM, Manu Zhang wrote:
>
> > So that's what *ShuffleVertexManager.determineParallelismAndApply* does
> ?
> >
> >
> > On Fri, Jan 24, 2014 at 11:35 AM, Gopal Vijayaraghavan
> <gop...@apache.org>wrote:
> >
> >> On Thu, Jan 23, 2014 at 7:25 PM, Manu Zhang <owenzhang1...@gmail.com>
> >> wrote:
> >>
> >>> Another question is that, when running Hive over Tez, why is the
> >>> number
> >> of
> >>> reducers not the same as that of Hive over MR, provided with the
> >>> same
> >> input
> >>> data and configurations ?
> >>
> >> This is too big a note to write out into an email
> >>
> >> http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/
> >>
> >> But in short, unlike MR, Tez enables you to set reducer count
> dynamically.
> >>
> >> Cheers,
> >> Gopal
> >>
> >>> On Tue, Jan 21, 2014 at 10:30 AM, Bikas Saha <bi...@hortonworks.com>
> >> wrote:
> >>>
> >>>> It would help us understand your situation if you could give a
> >>>> short description of how you changes are going to speed up Map
> >>>> output. If it
> >> is
> >>>> generally useful then we could consider adding it the existing
> >>>> library
> >> of
> >>>> inputs and outputs.
> >>>>
> >>>> The number of physical outputs == the number of downstream
> >>>> consumers of that partitioned data. Think of them as the number of
> >>>> reducers. So if there are N reducers (and thus you will be
> >>>> partitioning the data N ways) then number of physical output == N.
> >>>>
> >>>> The MRPartitioner config item (tez.runtime.num.expected.partitions)
> >>>> is used to communicate the above information to the MRPartitioner.
> >>>> In the above example you would set it to N so that the
> >>>> MRPartitioner would partition the data N ways.
> >>>>
> >>>> Hive uses internal statistics to calculate the expected number of
> >>>> partitions at compile time. However once Hive determines the number
> >>>> of tasks (say reducers) then the partitioner will always get that
> >>>> value for the number of partitions to create.
> >>>>
> >>>> Not sure what you mean by partitionId will exceed that number when
> >> running
> >>>> some jobs. Can you please elaborate? Do you mean that the
> >>>> partitioner.getPartition(Key, Value, Partition) is getting a value
> >>>> for Partition > num physical outputs? In that case, please check
> >>>> your customized output code because that should be the one calling
> >>>> the
> >>>> getPartition() method.
> >>>>
> >>>> Bikas
> >>>>
> >>>> -----Original Message-----
> >>>> From: Manu Zhang [mailto:owenzhang1...@gmail.com]
> >>>> Sent: Monday, January 20, 2014 6:10 PM
> >>>> To: dev@tez.incubator.apache.org
> >>>> Subject: How to decide number of partitions in a Map Output
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I've been working on a customized Output which works like
> >>>> OnFileSortedOutput but with optimizations that will speed up Map
> output.
> >>>>
> >>>> The issue is about the *number of partitions*. My current
> >>>> implementation is set it to number of physicalOutputs but the
> >>>> *partitionId will exceed that
> >>>> number* when runnning some jobs.
> >>>>
> >>>> After referring to  MRPartitioner, I found the number of partition
> >>>> is
> >> set
> >>>> to "tez.runtime.num.expected.partitions" (or 1 if null) . So what
> >>>> is the difference between that property and physicalOutputs ?
> >>>>
> >>>> Also , when running Hive queries over Tez (with my customized
> >>>> output), a Hive property "hive.exec.reducers.bytes.per.reducer"
> >>>> could also alter
> >> the
> >>>> number of partitions, according to my observation.
> >>>>
> >>>> Any ideas ?
> >>>> Thanks
> >>>>
> >>>> Manu Zhang
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> >> entity to
> >>>> which it is addressed and may contain information that is
> >>>> confidential, privileged and exempt from disclosure under
> >>>> applicable law. If the
> >> reader
> >>>> of this message is not the intended recipient, you are hereby
> >>>> notified
> >> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you
> >>>> have received this communication in error, please contact the
> >>>> sender
> >> immediately
> >>>> and delete it from your system. Thank You.
> >>>>
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> >> entity to which it is addressed and may contain information that is
> >> confidential, privileged and exempt from disclosure under applicable
> >> law. If the reader of this message is not the intended recipient, you
> >> are hereby notified that any printing, copying, dissemination,
> >> distribution, disclosure or forwarding of this communication is
> >> strictly prohibited. If you have received this communication in
> >> error, please contact the sender immediately and delete it from your
> system. Thank You.
> >>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: How to decide number of partitions in a Map Output

Reply via email to