PERFORMANCE: Implement a map-side group operator to speed up processing of 
ordered data 

                 Key: PIG-984
             Project: Pig
          Issue Type: New Feature
            Reporter: Richard Ding

The general group by operation in Pig needs both mappers and reducers (the 
aggregation is done in reducers). This incurs disk writes/reads  between 
mappers and reducers.

However, in the cases where the input data has the following properties

   1. The records with the same key are grouped together (such as the data is 
sorted by the keys).
   2. The records with the same key are in the same mapper input.

the group by operation can be performed in the mappers only and thus remove the 
overhead of disk writes/reads.

Alan proposed adding a hint to the group by clause like this one:

A = load 'input' using SomeLoader(...);
B = group A by $0 using "mapside";
C = foreach B generate ...

The proposed addition of using "mapside" to group will be a mapside group 
operator that collects all records for a given key into a buffer. When it sees 
a key change it will emit the key and bag for records it had buffered. It will 
assume that all keys for a given record are collected together and thus there 
is not need to buffer across keys. 

It is expected that "SomeLoader" will be implemented by data systems such as 
Zebra to ensure the data emitted by the loader satisfies the above properties 
(1) and (2).

It will be the responsibility of the user (or the loader) to guarantee these 
properties (1) & (2) before invoking the mapside hint for the group by clause. 
The Pig runtime can't check for the errors in the input data.

For the group by clauses with mapside hint, Pig Latin will only support group 
by columns (including *), not group by expressions nor group all. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to