[
https://issues.apache.org/jira/browse/PIG-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olga Natkovich reassigned PIG-350:
----------------------------------
Assignee: Alan Gates (was: Daniel Dai)
> Join optimization for pipeline rework
> -------------------------------------
>
> Key: PIG-350
> URL: https://issues.apache.org/jira/browse/PIG-350
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Reporter: Alan Gates
> Assignee: Alan Gates
> Priority: Critical
> Fix For: types_branch
>
> Attachments: join.patch, join2.patch, join3.patch, join4.patch
>
>
> Currently, joins in pig are done as groupings where each input is grouped on
> the join key. In the reduce phase, records from each input are collected
> into a bag for each key, and then a cross product done on these bags. This
> can be optimized by selecting one (hopefully the largest) input and streaming
> through it rather than placing the results in a bag. This will result in
> better memory usage, less spills to disk due to bag overflow, and better
> performance. Ideally, the system would intelligently select which input to
> stream, based on a histogram of value distributions for the keys. Pig does
> not have that kind of metadata. So for now it is best to always pick the
> same input (first or last) so that the user can select which input to stream.
> Similarly, order by in pig is done in this same way, with the grouping keys
> being the ordering keys, and only one input. In this case pig still
> currently collects all the records for a key into a bag, and then flattens
> the bag. This is a total waste, and in some cases causes significant
> performance degradation. The same optimization listed above can address this
> case, where the last bag (in this case the only bag) is streamed rather than
> collected.
> To do these operations, a new POJoinPackage will be needed. It will replace
> POPackage and the following POForEach in these types of scripts, handling
> pulling the records from hadoop and streaming them into the pig pipeline. A
> visitor will need to be added in the map reduce compilation phase that
> detects this case and combines the POPackage with POForeach into this new
> POJoinPackage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.