Currently the Algebraic interface allows a UDF writer to have an
Initial, Intermediate and Final class (each of which should implement
EvalFunc). The idea is that the UDF can be called in stages -
Initial.exec() in the map, Intermediate.exec() in the combiner and
Final.exec() in the Reduce. The UDF (say COUNT) which implements
Algebraic, also extends EvalFunc. This means that it has an exec()
method. Currently Pig calls this exec() method at the top level when the
UDF is not "combinable". When it is "combinable", Pig currently calls
Initial.exec() in the combine and Final.exec() in the Reduce. I will be
changing the "combinable" case to call Initial.exec() in the map,
Intermediate.exec() in the combine and Final.exec() in the reduce as
part of https://issues.apache.org/jira/browse/PIG-563. 


There are two options for the Non combinable case: 

1)       The way it is described above - top level UDF's exec() is
called when combiner is not used and if combiner is used, Initial.exec()
is called in the map, Intermediate.exec() in the combine and
Final.exec() in the reduce.

*         Pros: 

a.       Initial.exec() can be optimized with the knowledge that it is
only called in the map. For example, in UDFs like COUNT, since
Initial.exec() is always going to be called in map, the implementation
can be optimized to simply emit Integer 1.

*         Cons: 

a.       UDF writer has to potentially write two different code paths -
one where UDF.exec() computes the result completely in the reduce() and
another where Initial.exec(), Intermediate.exec() and Final.exec()
compute the result in parts in the map, combine and reduce respectively.


2)       If a UDF implements Algebraic, Pig will have to guarantee that
Initial.exec() will be called and later Final.exec() will be called. If
the UDF is combinable, these will be called from map and reduce
respectively and Intermediate.exec() will be called from the combine. If
the UDF is NOT combinable, Initial.exec() will be called first in the
reduce, then its output will be put in a bag and supplied to a call of
the Final.exec(). In both the cases the top level exec() of the UDF will
never be called.

*         Pros: 

a.       The guarantee that Initial.exec() and Final.exec() are called
in both combinable and non combinable cases. 

*         Cons: 

a.       The UDF writer has to give a dummy implementation for
UDF.exec() to satisfy the EvalFunc interface though UDF.exec() is never

b.       UDF writer should make sure the Initial.exec() and Final.exec()
work in both the combinable and non combinable cases. 

c.       There are performance penalties - in the combinable case, the
Initial.exec() cannot be optimized since there is no guarantee that it
is always called in the map. In the non combinable case, the call to
Initial.exec() will contain all input and hence the result can be
computed in that call itself. Despite this, Pig will have to take the
result of Initial.exec(), put it in a bag and call Final.exec() which
can be highly inefficient.


I would vote for option 1 since it is much better from a performance


Please provide Comments/Suggestions on the proposal.





Reply via email to