pi song wrote:
1) I think this is a query optimizer thing to do (filter pushing). In fact,
the operation will be :-
b = group a by $0;
c = filter b by group != 'fred';
becomes
before_b = filter a by $0 != 'fred'
b = group a by $0;
c = b ;
so it can be done in map stage, before combiner.
I agree that the query optimizer should be pushing filters when
possible. But in some cases it may choose not to. If the user turns it
off, or if the filter involves a UDF (which could potentially be
expensive and thus we may not want to push it before the grouping) then
the filter will still be after the grouping. In this case we can still
push it up into the combiner. I agree this is a lower priority change
than the others.
2a) According the new combiner behavior discussed in PIG-274, seems like we
will have to
- call initial function at the end of map stage.
- call intermediate function as combiner (because it is optional)
- call final in reduce
Eventually, yes. But we have gotten the hadoop folks to agree that for
at least version 0.18 they will support a backward compatibility option
where the combiner will be run exactly 1 time instead of running 0-n
times. I'm not writing this combiner logic to handle the 0-n case yet,
as that's a big change that I don't want to include it with the already
overlong pipeline rework.
Alan.
Pi
On Wed, Jul 16, 2008 at 10:41 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
One of the changes we want to make in the new pig pipeline is to make much
more aggressive use of the combiner. In thinking through when we should use
the combiner, I came up with the following. The list is not exhaustive, but
it includes common expressions and should be possible to implement within a
week or so. If you can think of other rules that fit this profile, please
suggest them.
1) Filters.
a) If the predicate does not operate on any of the bags (that is, it
only operates on the grouping key) then the filter will be relocated
to the combiner phase. For example:
b = group a by $0;
c = filter b by group != 'fred';
...
In this case subsequent operations to the filter could also be
considered for pushing into the combiner.
b) If it operates on the bags with an algebraic function, then a
foreach with the initial function will be placed in the combiner
phase and the filter in the reduce phase will be changed to use the
final function. For example:
b = group a by $0;
c = filter b by count(a) > 0;
...
2) Foreach.
a) If the foreach does not contain a nested plan and all UDFs in the
generate statement are algebraic, then the foreach will be copied
and placed in the combiner phase. The version of the foreach in the
combiner stage will use the initial function, and the version in the
reduce stage will be changed to use the final function. For
example:
b = group a by $0;
c = foreach b generate group, group + 5, sum(a.$1);
b) If the foreach has an inner plan that has a distinct and no
filters, then it will be left as is in the reduce plan and a
combiner plan will be created that runs the inner plan minus the
generate on the tuples, thus creating the distinct portion of the
data without applying the UDF. For example:
b = group a by $0;
c = foreach b {
c1 = distinct $1;
generate group, COUNT(c1);
}
3) Distinct. This will be converted to apply the distinct in the combiner
as well as in the reducer.
4) Limit. The limit will be applied in the combiner and again in the
reducer.
Alan.