[zebra] performance improvements
--------------------------------
Key: PIG-1198
URL: https://issues.apache.org/jira/browse/PIG-1198
Project: Pig
Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Fix For: 0.7.0
Current input split generation is row-based split on individual TFiles. This
leaves undesired fact that even for TFiles smaller than one block one split is
still generated for each. Consequently, there will be many mappers, and many
waves, needed to handle the many small TFiles generated by as many
mappers/reducers that wrote the data. This issue can be addressed by generating
input splits that can include multiple TFiles.
For sorted tables, key distribution generation by table, which is used to
generated proper input splits, includes key distributions from column groups
even they are not in projection. This incurs extra cost to perform unnecessary
computations and, more inappropriately, creates unreasonable results on input
split generations;
For unsorted tables, when row split is generated on a union of tables, the
FileSplits are generated for each table and then lumped together to form the
final list of splits to Map/Reduce. This has a undesirable fact that number of
splits is subject to the number of tables in the table union and not just
controlled by the number of splits used by the Map/Reduce framework;
The input split's goal size is calculated on all column groups even if some of
them are not in projection;
For input splits of multiple files in one column group, all files are opened at
startup. This is unnecessary and takes unnecessarily resources from start to
end. The files should be opened when needed and closed when not;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.