[zebra] performance improvements
--------------------------------

                 Key: PIG-1198
                 URL: https://issues.apache.org/jira/browse/PIG-1198
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.6.0
            Reporter: Yan Zhou
            Assignee: Yan Zhou
             Fix For: 0.7.0


Current input split generation is row-based split on individual TFiles. This 
leaves undesired fact that even for TFiles smaller than one block one split is 
still generated for each. Consequently, there will be many mappers, and many 
waves, needed to handle the many small TFiles generated by as many 
mappers/reducers that wrote the data. This issue can be addressed by generating 
input splits that can include multiple TFiles. 

For sorted tables, key distribution generation by table, which is used to 
generated proper input splits, includes key distributions from column groups 
even they are not in projection. This incurs extra cost to perform unnecessary 
computations and, more inappropriately, creates unreasonable results on input 
split generations; 

For unsorted tables, when row split is generated on a union of tables, the 
FileSplits are generated for each table and then lumped together to form the 
final list of splits to Map/Reduce. This has a undesirable fact that number of 
splits is subject to the number of tables in the table union and not just 
controlled by the number of splits used by the Map/Reduce framework; 

The input split's goal size is calculated on all column groups even if some of 
them are not in projection; 

For input splits of multiple files in one column group, all files are opened at 
startup. This is unnecessary and takes unnecessarily resources from start to 
end. The files should be opened when needed and closed when not; 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to