One of the changes we want to make in the new pig pipeline is to make much more aggressive use of the combiner. In thinking through when we should use the combiner, I came up with the following. The list is not exhaustive, but it includes common expressions and should be possible to implement within a week or so. If you can think of other rules that fit this profile, please suggest them.

1) Filters.
   a) If the predicate does not operate on any of the bags (that is, it
   only operates on the grouping key) then the filter will be relocated
   to the combiner phase.  For example:

   b = group a by $0;
   c = filter b by group != 'fred';
   ...

   In this case subsequent operations to the filter could also be
   considered for pushing into the combiner.

   b) If it operates on the bags with an algebraic function, then a
   foreach with the initial function will be placed in the combiner
   phase and the filter in the reduce phase will be changed to use the
   final function.  For example:

   b = group a by $0;
   c = filter b by count(a) > 0;
   ...

2) Foreach.
   a) If the foreach does not contain a nested plan and all UDFs in the
   generate statement are algebraic, then the foreach will be copied
   and placed in the combiner phase.  The version of the foreach in the
   combiner stage will use the initial function, and the version in the
   reduce stage     will be changed to use the final function.  For
   example:

   b = group a by $0;
   c = foreach b generate group, group + 5, sum(a.$1);

   b) If the foreach has an inner plan that has a distinct and no
   filters, then it will be left as is in the reduce plan and a
   combiner plan will be created that runs the inner plan minus the
   generate on the tuples, thus creating the distinct portion of the
   data without applying the UDF.  For example:

   b = group a by $0;
   c = foreach b {
       c1 = distinct $1;
       generate group, COUNT(c1);
   }

3) Distinct. This will be converted to apply the distinct in the combiner as well as in the reducer.

4) Limit. The limit will be applied in the combiner and again in the reducer.

Alan.

Reply via email to