It would help us understand your situation if you could give a short
description of how you changes are going to speed up Map output. If it is
generally useful then we could consider adding it the existing library of
inputs and outputs.

The number of physical outputs == the number of downstream consumers of
that partitioned data. Think of them as the number of reducers. So if
there are N reducers (and thus you will be partitioning the data N ways)
then number of physical output == N.

The MRPartitioner config item (tez.runtime.num.expected.partitions) is
used to communicate the above information to the MRPartitioner. In the
above example you would set it to N so that the MRPartitioner would
partition the data N ways.

Hive uses internal statistics to calculate the expected number of
partitions at compile time. However once Hive determines the number of
tasks (say reducers) then the partitioner will always get that value for
the number of partitions to create.

Not sure what you mean by partitionId will exceed that number when running
some jobs. Can you please elaborate? Do you mean that the
partitioner.getPartition(Key, Value, Partition) is getting a value for
Partition > num physical outputs? In that case, please check your
customized output code because that should be the one calling the
getPartition() method.

Bikas

-----Original Message-----
From: Manu Zhang [mailto:owenzhang1...@gmail.com]
Sent: Monday, January 20, 2014 6:10 PM
To: dev@tez.incubator.apache.org
Subject: How to decide number of partitions in a Map Output

Hi all,

I've been working on a customized Output which works like
OnFileSortedOutput but with optimizations that will speed up Map output.

The issue is about the *number of partitions*. My current implementation
is set it to number of physicalOutputs but the *partitionId will exceed
that
number* when runnning some jobs.

After referring to  MRPartitioner, I found the number of partition is set
to "tez.runtime.num.expected.partitions" (or 1 if null) . So what is the
difference between that property and physicalOutputs ?

Also , when running Hive queries over Tez (with my customized output), a
Hive property "hive.exec.reducers.bytes.per.reducer" could also alter the
number of partitions, according to my observation.

Any ideas ?
Thanks

Manu Zhang

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to