[
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yan Zhou updated PIG-1198:
--------------------------
Attachment: PIG-1198.patch
To address the review comments.
> [zebra] performance improvements
> --------------------------------
>
> Key: PIG-1198
> URL: https://issues.apache.org/jira/browse/PIG-1198
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.6.0
> Reporter: Yan Zhou
> Assignee: Yan Zhou
> Fix For: 0.7.0
>
> Attachments: PIG-1198.patch, PIG-1198.patch, PIG-1198.patch
>
>
> Current input split generation is row-based split on individual TFiles. This
> leaves undesired fact that even for TFiles smaller than one block one split
> is still generated for each. Consequently, there will be many mappers, and
> many waves, needed to handle the many small TFiles generated by as many
> mappers/reducers that wrote the data. This issue can be addressed by
> generating input splits that can include multiple TFiles.
> For sorted tables, key distribution generation by table, which is used to
> generated proper input splits, includes key distributions from column groups
> even they are not in projection. This incurs extra cost to perform
> unnecessary computations and, more inappropriately, creates unreasonable
> results on input split generations;
> For unsorted tables, when row split is generated on a union of tables, the
> FileSplits are generated for each table and then lumped together to form the
> final list of splits to Map/Reduce. This has a undesirable fact that number
> of splits is subject to the number of tables in the table union and not just
> controlled by the number of splits used by the Map/Reduce framework;
> The input split's goal size is calculated on all column groups even if some
> of them are not in projection;
> For input splits of multiple files in one column group, all files are opened
> at startup. This is unnecessary and takes unnecessarily resources from start
> to end. The files should be opened when needed and closed when not;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.