Thanks, this is extremely helpful.

On Sat, Jan 25, 2014 at 1:45 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> Set tez.am.shuffle-vertex-manager.enable.auto-parallel=true for tez
> auto-reduce.
>
> >> About the Hive compiler feature, could you elaborate a bit more ? Do
> you mean the *SetReducerParallelism *class ? In one case,  the number of
> reducers of Hive-over-MR is exactly *(number of bytes read from HDFS /
> hive.exec.reducers.bytes.per.reducer)* by my calculation, which is how
> SetPeducerParallelism does it.
>
> + Gunther for the hive question. In case Gunther does not respond please
> email the hive user/dev lists with your question.
>
> Bikas
>
> -----Original Message-----
> From: Manu Zhang [mailto:owenzhang1...@gmail.com]
> Sent: Friday, January 24, 2014 2:22 AM
> To: dev@tez.incubator.apache.org
> Subject: Re: How to decide number of partitions in a Map Output
>
> Hi Bikas,
>
> Could you please specify the property to turn on that feature in Tez ?
> About the Hive compiler feature, could you elaborate a bit more ? Do you
> mean the *SetReducerParallelism *class ? In one case,  the number of
> reducers of Hive-over-MR is exactly *(number of bytes read from HDFS /
> hive.exec.reducers.bytes.per.reducer)* by my calculation, which is how
> SetPeducerParallelism does it.
>
> Thanks,
> Manu
>
>
> On Fri, Jan 24, 2014 at 1:09 PM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> > This feature is turned off by default in Tez. So unless you have
> > turned it on (or hive turns it on), the number of reducers is probably
> > different because latest Hive compiler tries to determine the number
> > of reducers via table statistics at compile time. It may be that this
> > Hive compiler feature is on by default for Hive-on-Tez compilation but
> > off by default for Hive-on-MR compilation.
> >
> > Bikas
> >
> > -----Original Message-----
> > From: Hitesh Shah [mailto:hit...@apache.org]
> > Sent: Thursday, January 23, 2014 8:27 PM
> > To: dev@tez.incubator.apache.org
> > Subject: Re: How to decide number of partitions in a Map Output
> >
> > Yes. It looks at the size of outputs generated by the tasks in the
> > previous vertex and determines the no. of tasks to run based on the
> > configured amount of input data size per task.
> >
> > -- Hitesh
> >
> > On Jan 23, 2014, at 7:59 PM, Manu Zhang wrote:
> >
> > > So that's what *ShuffleVertexManager.determineParallelismAndApply*
> > > does
> > ?
> > >
> > >
> > > On Fri, Jan 24, 2014 at 11:35 AM, Gopal Vijayaraghavan
> > <gop...@apache.org>wrote:
> > >
> > >> On Thu, Jan 23, 2014 at 7:25 PM, Manu Zhang
> > >> <owenzhang1...@gmail.com>
> > >> wrote:
> > >>
> > >>> Another question is that, when running Hive over Tez, why is the
> > >>> number
> > >> of
> > >>> reducers not the same as that of Hive over MR, provided with the
> > >>> same
> > >> input
> > >>> data and configurations ?
> > >>
> > >> This is too big a note to write out into an email
> > >>
> > >> http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguratio
> > >> n/
> > >>
> > >> But in short, unlike MR, Tez enables you to set reducer count
> > dynamically.
> > >>
> > >> Cheers,
> > >> Gopal
> > >>
> > >>> On Tue, Jan 21, 2014 at 10:30 AM, Bikas Saha
> > >>> <bi...@hortonworks.com>
> > >> wrote:
> > >>>
> > >>>> It would help us understand your situation if you could give a
> > >>>> short description of how you changes are going to speed up Map
> > >>>> output. If it
> > >> is
> > >>>> generally useful then we could consider adding it the existing
> > >>>> library
> > >> of
> > >>>> inputs and outputs.
> > >>>>
> > >>>> The number of physical outputs == the number of downstream
> > >>>> consumers of that partitioned data. Think of them as the number
> > >>>> of reducers. So if there are N reducers (and thus you will be
> > >>>> partitioning the data N ways) then number of physical output == N.
> > >>>>
> > >>>> The MRPartitioner config item
> > >>>> (tez.runtime.num.expected.partitions)
> > >>>> is used to communicate the above information to the MRPartitioner.
> > >>>> In the above example you would set it to N so that the
> > >>>> MRPartitioner would partition the data N ways.
> > >>>>
> > >>>> Hive uses internal statistics to calculate the expected number of
> > >>>> partitions at compile time. However once Hive determines the
> > >>>> number of tasks (say reducers) then the partitioner will always
> > >>>> get that value for the number of partitions to create.
> > >>>>
> > >>>> Not sure what you mean by partitionId will exceed that number
> > >>>> when
> > >> running
> > >>>> some jobs. Can you please elaborate? Do you mean that the
> > >>>> partitioner.getPartition(Key, Value, Partition) is getting a
> > >>>> value for Partition > num physical outputs? In that case, please
> > >>>> check your customized output code because that should be the one
> > >>>> calling the
> > >>>> getPartition() method.
> > >>>>
> > >>>> Bikas
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Manu Zhang [mailto:owenzhang1...@gmail.com]
> > >>>> Sent: Monday, January 20, 2014 6:10 PM
> > >>>> To: dev@tez.incubator.apache.org
> > >>>> Subject: How to decide number of partitions in a Map Output
> > >>>>
> > >>>> Hi all,
> > >>>>
> > >>>> I've been working on a customized Output which works like
> > >>>> OnFileSortedOutput but with optimizations that will speed up Map
> > output.
> > >>>>
> > >>>> The issue is about the *number of partitions*. My current
> > >>>> implementation is set it to number of physicalOutputs but the
> > >>>> *partitionId will exceed that
> > >>>> number* when runnning some jobs.
> > >>>>
> > >>>> After referring to  MRPartitioner, I found the number of
> > >>>> partition is
> > >> set
> > >>>> to "tez.runtime.num.expected.partitions" (or 1 if null) . So what
> > >>>> is the difference between that property and physicalOutputs ?
> > >>>>
> > >>>> Also , when running Hive queries over Tez (with my customized
> > >>>> output), a Hive property "hive.exec.reducers.bytes.per.reducer"
> > >>>> could also alter
> > >> the
> > >>>> number of partitions, according to my observation.
> > >>>>
> > >>>> Any ideas ?
> > >>>> Thanks
> > >>>>
> > >>>> Manu Zhang
> > >>>>
> > >>>> --
> > >>>> CONFIDENTIALITY NOTICE
> > >>>> NOTICE: This message is intended for the use of the individual or
> > >> entity to
> > >>>> which it is addressed and may contain information that is
> > >>>> confidential, privileged and exempt from disclosure under
> > >>>> applicable law. If the
> > >> reader
> > >>>> of this message is not the intended recipient, you are hereby
> > >>>> notified
> > >> that
> > >>>> any printing, copying, dissemination, distribution, disclosure or
> > >>>> forwarding of this communication is strictly prohibited. If you
> > >>>> have received this communication in error, please contact the
> > >>>> sender
> > >> immediately
> > >>>> and delete it from your system. Thank You.
> > >>>>
> > >>
> > >> --
> > >> CONFIDENTIALITY NOTICE
> > >> NOTICE: This message is intended for the use of the individual or
> > >> entity to which it is addressed and may contain information that is
> > >> confidential, privileged and exempt from disclosure under
> > >> applicable law. If the reader of this message is not the intended
> > >> recipient, you are hereby notified that any printing, copying,
> > >> dissemination, distribution, disclosure or forwarding of this
> > >> communication is strictly prohibited. If you have received this
> > >> communication in error, please contact the sender immediately and
> > >> delete it from your
> > system. Thank You.
> > >>
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or
> > entity to which it is addressed and may contain information that is
> > confidential, privileged and exempt from disclosure under applicable
> > law. If the reader of this message is not the intended recipient, you
> > are hereby notified that any printing, copying, dissemination,
> > distribution, disclosure or forwarding of this communication is
> > strictly prohibited. If you have received this communication in error,
> > please contact the sender immediately and delete it from your system.
> Thank You.
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Reply via email to