Rohini Palaniswamy created PIG-4344:
---------------------------------------
Summary: Add a testcase with CustomPartitioner that tests ordering
within a reducer
Key: PIG-4344
URL: https://issues.apache.org/jira/browse/PIG-4344
Project: Pig
Issue Type: Bug
Reporter: Rohini Palaniswamy
Some of our users have a CustomPartitioner with join or group by as they know
their data and know the keys to partition on. Since mapreduce provides data
sorted within a reducer, they rely on that to have the data ordered as well.
For eg:
partition = group mydata by (hour, sortkey1, sortkey2, sortkey3) using
MyCustomPartitioner PARALLEL 24;
The custom partitioner sends hours 0-23 to partitions 0-23, which ensures that
the data is also sorted without having to do a group by.
With HCatStorer, this pattern will be used more. i.e,
partition = group mydata by (hour) using MyCustomPartitioner PARALLEL 24;
store partition into 'mydb.mytable' using HCatStorer();
instead of
store mydata into 'mydb.mytable' using HCatStorer();
where hour is the partition. The extra groupby above is to avoid having 1 file
created per partition instead of 24 files per partition and concatenating them
later to save namespace.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)