Re: [jira] Created: (PIG-350) Join optimization for pipeline rework

Mridul Muralidharan Fri, 01 Aug 2008 13:09:16 -0700


Hi Alan,


  Please see inline.

Regards,
Mridul

Alan Gates wrote:

In the current version, the combiner is not used with cogroup. With thepipeline rework going on in the types branch, the combiner will be usedfor cogroups like:
C = cogroup A, B;
D = foreach C generate project, algebraic, algebraic, ...
where project is a non-UDF expression that projects fields from C andalgebraic represents an algebraic UDF on one of the fields of C.Projections that are flattened will not be combined, because all therecords are necessary to properly materialize the cross product. Sothat means the optimization proposed in pig 350 won't interact with thecombiner.



This is great to hear, I should have followed this a bit more carefully.
Will need to consider your comment above a bit more later on, thanks !

As far as cross and the combiner, we don't yet have a combiner algorithmfor optimizing cross. This is doable but complicated. Are youcurrently using cross? We had not focussed on this as an optimizationarea because we were not aware of people who used it.



One primary usecase which comes to mind for cross is something like this :

Inspite of the obvious cost, we are having to use CROSS for applyingjoin constraints across 'tables' (sql like psox joins : not normal pigtable joins). So we end up with something like this currently :



entity1 = apply entity1 constraints from relevant table(s).
entity2 = apply entity2 constraints from relevant table(s).
entity3 = apply entity3 constraints from relevant table(s).

--- for inter entity constraints :
cross_res = CROSS entity1, entity2, entity3, ...;
constrained_op = FOREACH cross_res {
  --- apply interentity constraints.
  GENERATE valid, entity1, entity2, entity3;
};
constrained_op = FILTER constrained_op BY valid == '1'

This could be split into a series of cogroup's and filters (imo) but Ihave not researched too much into that (in interest of time, nothing else).

But as should be obvious here, the cost of cross is extremely high (eventhough the entity tables are constrained already - the cross can 'blowup') ... and most (very high percentage) of the output is going to bediscarded.Which is why, if there is any form of combiner which is applicable, orany optimization which is possible here, it would be tremendouslybenificial for us - or even any form of optimization on cross for thatmatter.

You mention using the combiner with filters. Were you wanting us tocatch cases like:
B = group A;
C = filter B by $0 > 5;
D = foreach C generate group, COUNT(A);
and push both the filter and the foreach into the combiner? That ispossible, but we have put that off in favor of instead pushing thefilter above the group. (We don't do this filter pushing yet, but workis on going to develop an optimizer that will do these kinds ofoptimizations.) The only case we could think of where you wouldn't want


My current assumption was that group followed by foreach is combinable.

So my though is to use foreach also as filter - to generate a validityflag which could be filtered later, and generate data only if this isvalid (other wise, it will be empty - thereby saving on data size).But if both filter or both filter + foreach or combinations thereofcould be pushed as combiner, that would be absolute great.Did not bug too much about it since we have not yet got to the stagewhere these things are a concern (I am still in early prototype when Iget time).




Thanks,
Mridul

to push the filter (and we won't) is when the filter involves a udfwhich might be very expensive to call so you want to wait until afterthe data is grouped to minimize the number of calls to the UDF.
Alan.

Mridul Muralidharan wrote:
This would be absolutely great !
Btw, hope this continues to work fine with combiners in case ofCOGROUP + FILTER (combiners are applicable in this case right ? oronly for group ?).
Additionally, what would the impact of this be on CROSS + FILTER ? (Iam assuming that CROSS + FILTER is not combinable currently)
Thanks,
Mridul

Alan Gates (JIRA) wrote:
Join optimization for pipeline rework
-------------------------------------

                 Key: PIG-350
                 URL: https://issues.apache.org/jira/browse/PIG-350
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
            Reporter: Alan Gates
            Assignee: Daniel Dai
            Priority: Critical
             Fix For: types_branch
Currently, joins in pig are done as groupings where each input isgrouped on the join key. In the reduce phase, records from eachinput are collected into a bag for each key, and then a cross productdone on these bags. This can be optimized by selecting one(hopefully the largest) input and streaming through it rather thanplacing the results in a bag. This will result in better memoryusage, less spills to disk due to bag overflow, and betterperformance. Ideally, the system would intelligently select whichinput to stream, based on a histogram of value distributions for thekeys. Pig does not have that kind of metadata. So for now it isbest to always pick the same input (first or last) so that the usercan select which input to stream.
Similarly, order by in pig is done in this same way, with thegrouping keys being the ordering keys, and only one input. In thiscase pig still currently collects all the records for a key into abag, and then flattens the bag. This is a total waste, and in somecases causes significant performance degradation. The sameoptimization listed above can address this case, where the last bag(in this case the only bag) is streamed rather than collected.
To do these operations, a new POJoinPackage will be needed. It willreplace POPackage and the following POForEach in these types ofscripts, handling pulling the records from hadoop and streaming theminto the pig pipeline. A visitor will need to be added in the mapreduce compilation phase that detects this case and combines thePOPackage with POForeach into this new POJoinPackage.

Re: [jira] Created: (PIG-350) Join optimization for pipeline rework

Reply via email to