Rohini Palaniswamy created PIG-4344:
---------------------------------------

             Summary: Add a testcase with CustomPartitioner that tests ordering 
within a reducer
                 Key: PIG-4344
                 URL: https://issues.apache.org/jira/browse/PIG-4344
             Project: Pig
          Issue Type: Bug
            Reporter: Rohini Palaniswamy


 Some of our users have a CustomPartitioner with join or group by as they know 
their data and know the keys to partition on. Since mapreduce provides data 
sorted within a reducer, they rely on that to have the data ordered as well. 

For eg:
partition = group  mydata by (hour, sortkey1, sortkey2, sortkey3) using 
MyCustomPartitioner PARALLEL 24;

The custom partitioner sends hours 0-23 to partitions 0-23, which ensures that 
the data is also sorted without having to do a group by.  

With HCatStorer, this pattern will be used more. i.e, 
partition = group  mydata by (hour) using MyCustomPartitioner PARALLEL 24;
store partition into 'mydb.mytable' using HCatStorer();
    instead of
store mydata into 'mydb.mytable' using HCatStorer();

where hour is the partition. The extra groupby above is to avoid having 1 file 
created per partition instead of 24 files per partition and concatenating them 
later to save namespace. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to