pi song wrote:
1) I think this is a query optimizer thing to do (filter pushing). In fact,
the operation will be :-

b = group a by $0;
c = filter b by group != 'fred';

           becomes

before_b = filter a by $0 != 'fred'
b = group a by $0;
c = b ;

so it can be done in map stage, before combiner.
I agree that the query optimizer should be pushing filters when possible. But in some cases it may choose not to. If the user turns it off, or if the filter involves a UDF (which could potentially be expensive and thus we may not want to push it before the grouping) then the filter will still be after the grouping. In this case we can still push it up into the combiner. I agree this is a lower priority change than the others.
2a) According the new combiner behavior discussed in PIG-274, seems like we
will have to
   - call initial function at the end of map stage.
   - call intermediate function as combiner (because it is optional)
   - call final in reduce
Eventually, yes. But we have gotten the hadoop folks to agree that for at least version 0.18 they will support a backward compatibility option where the combiner will be run exactly 1 time instead of running 0-n times. I'm not writing this combiner logic to handle the 0-n case yet, as that's a big change that I don't want to include it with the already overlong pipeline rework.

Alan.
Pi

On Wed, Jul 16, 2008 at 10:41 AM, Alan Gates <[EMAIL PROTECTED]> wrote:

One of the changes we want to make in the new pig pipeline is to make much
more aggressive use of the combiner.  In thinking through when we should use
the combiner, I came up with the following.  The list is not exhaustive, but
it includes common expressions and should be possible to implement within a
week or so.  If you can think of other rules that fit this profile, please
suggest them.

1) Filters.
  a) If the predicate does not operate on any of the bags (that is, it
  only operates on the grouping key) then the filter will be relocated
  to the combiner phase.  For example:

  b = group a by $0;
  c = filter b by group != 'fred';
  ...

  In this case subsequent operations to the filter could also be
  considered for pushing into the combiner.

  b) If it operates on the bags with an algebraic function, then a
  foreach with the initial function will be placed in the combiner
  phase and the filter in the reduce phase will be changed to use the
  final function.  For example:

  b = group a by $0;
  c = filter b by count(a) > 0;
  ...

2) Foreach.
  a) If the foreach does not contain a nested plan and all UDFs in the
  generate statement are algebraic, then the foreach will be copied
  and placed in the combiner phase.  The version of the foreach in the
  combiner stage will use the initial function, and the version in the
  reduce stage     will be changed to use the final function.  For
  example:

  b = group a by $0;
  c = foreach b generate group, group + 5, sum(a.$1);

  b) If the foreach has an inner plan that has a distinct and no
  filters, then it will be left as is in the reduce plan and a
  combiner plan will be created that runs the inner plan minus the
  generate on the tuples, thus creating the distinct portion of the
  data without applying the UDF.  For example:

  b = group a by $0;
  c = foreach b {
      c1 = distinct $1;
      generate group, COUNT(c1);
  }

3) Distinct.  This will be converted to apply the distinct in the combiner
as well as in the reducer.

4) Limit.  The limit will be applied in the combiner and again in the
reducer.

Alan.


Reply via email to