Re: [jira] Created: (PIG-350) Join optimization for pipeline rework

Alan Gates Fri, 01 Aug 2008 08:13:23 -0700

In the current version, the combiner is not used with cogroup. With thepipeline rework going on in the types branch, the combiner will be usedfor cogroups like:


C = cogroup A, B;
D = foreach C generate project, algebraic, algebraic, ...

where project is a non-UDF expression that projects fields from C andalgebraic represents an algebraic UDF on one of the fields of C.Projections that are flattened will not be combined, because all therecords are necessary to properly materialize the cross product. Sothat means the optimization proposed in pig 350 won't interact with thecombiner.

As far as cross and the combiner, we don't yet have a combiner algorithmfor optimizing cross. This is doable but complicated. Are youcurrently using cross? We had not focussed on this as an optimizationarea because we were not aware of people who used it.

You mention using the combiner with filters. Were you wanting us tocatch cases like:


B = group A;
C = filter B by $0 > 5;
D = foreach C generate group, COUNT(A);

and push both the filter and the foreach into the combiner? That ispossible, but we have put that off in favor of instead pushing thefilter above the group. (We don't do this filter pushing yet, but workis on going to develop an optimizer that will do these kinds ofoptimizations.) The only case we could think of where you wouldn't wantto push the filter (and we won't) is when the filter involves a udfwhich might be very expensive to call so you want to wait until afterthe data is grouped to minimize the number of calls to the UDF.


Alan.

Mridul Muralidharan wrote:

This would be absolutely great !
Btw, hope this continues to work fine with combiners in case ofCOGROUP + FILTER (combiners are applicable in this case right ? oronly for group ?).
Additionally, what would the impact of this be on CROSS + FILTER ? (Iam assuming that CROSS + FILTER is not combinable currently)
Thanks,
Mridul

Alan Gates (JIRA) wrote:
Join optimization for pipeline rework
-------------------------------------

                 Key: PIG-350
                 URL: https://issues.apache.org/jira/browse/PIG-350
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
            Reporter: Alan Gates
            Assignee: Daniel Dai
            Priority: Critical
             Fix For: types_branch
Currently, joins in pig are done as groupings where each input isgrouped on the join key. In the reduce phase, records from eachinput are collected into a bag for each key, and then a cross productdone on these bags. This can be optimized by selecting one(hopefully the largest) input and streaming through it rather thanplacing the results in a bag. This will result in better memoryusage, less spills to disk due to bag overflow, and betterperformance. Ideally, the system would intelligently select whichinput to stream, based on a histogram of value distributions for thekeys. Pig does not have that kind of metadata. So for now it isbest to always pick the same input (first or last) so that the usercan select which input to stream.
Similarly, order by in pig is done in this same way, with thegrouping keys being the ordering keys, and only one input. In thiscase pig still currently collects all the records for a key into abag, and then flattens the bag. This is a total waste, and in somecases causes significant performance degradation. The sameoptimization listed above can address this case, where the last bag(in this case the only bag) is streamed rather than collected.
To do these operations, a new POJoinPackage will be needed. It willreplace POPackage and the following POForEach in these types ofscripts, handling pulling the records from hadoop and streaming theminto the pig pipeline. A visitor will need to be added in the mapreduce compilation phase that detects this case and combines thePOPackage with POForeach into this new POJoinPackage.

Re: [jira] Created: (PIG-350) Join optimization for pipeline rework

Reply via email to