[Pig Wiki] Update of "EvalFunction" by OlgaN

Apache Wiki Wed, 07 Nov 2007 10:39:13 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/EvalFunction

New page:
[[Anchor(Eval_Functions)]]
== Eval Functions ==
To create an eval function, the following abstract class must be extended. The 
parameter T is the return type of the eval function. 

{{{
public abstract class EvalFunc<T extends Datum>  {
    abstract public void exec(Tuple input, T output) throws IOException;
}
}}}

[[Anchor(Input_to_the_Functions)]]
=== Input to the Functions ===
The arguments to the function get wrapped in a tuple and are passed as the 
parameter `input` above. Thus, the first field of `input` is the first argument 
and so on. 

For example, suppose I have a data set A = 
{{{
<a, b, c>
<1, 2, 3>
}}}

Suppose, I have written an Eval Function !MyFunc and my !PigLatin is as follows:

{{{
B = foreach A generate MyFunc($0,$2);
}}}

Then !MyFunc will be called first with the tuple <a, c> and then with the tuple 
<1, 3>. 

[[Anchor(Output_of_the_functions)]]
=== Output of the functions ===

When extending the abstract class, the type parameter T must be bound to a 
subclass of Datum. (The compiler will allow you to subclass !EvalFunc<Datum> 
but you will get an error on using that function). When T is bound to a 
particular type of Datum ( !DataAtom, or Tuple, or !DataBag, or !DataMap), the 
eval function gets handed, through the parameter `output`, a Datum of type T to 
produce its output in. 

Note that in case T is a databag, although you get handed a !DataBag as the 
parameter `output`, this is an append-only data bag. Its contents always remain 
empty. This is a performance optimization (we use it for pipelining) based on 
the assumption that you wouldnt want to examine your own output.

[[Anchor(Example)]]
=== Example ===

As an example, here is the code for the builtin function TOKENIZE, that expects 
as input 1 argument of type data atom, and tokenizes the input data atom string 
to a data bag of tuples, one for each word in the input string.

{{{
public class TOKENIZE extends EvalFunc<DataBag> {

    @Override
    public void exec(Tuple input, DataBag output) throws IOException {
        String str = input.getAtomField(0).strval();
        StringTokenizer tok = new StringTokenizer(str, " \",()*", false);
        while (tok.hasMoreTokens()) {
            output.add(new Tuple(tok.nextToken()));
        }
    }
}}}


[[Anchor(Advanced_Features)]]
=== Advanced Features ===
   * '''Schemas''': Eval functions can declare their output schema by 
overriding the following method in !EvalFunc. See: PigLatinSchemas.

{{{
    /**
     * @param input Schema of the input
     * @return Schema of the output
     */
    public Schema outputSchema(Schema input)
    {
          return input.copy();
    }
}}}


   * '''Algebraic Eval Functions''' If the input to your function might be 
large (i.e. the input tuple may contain a large bag of tuples nested inside of 
it) and you are concerned about performance, you may want to consider writing 
your function in such a way that it can receive its input in small "chunks," 
one at a time, and then merge the per-chunk outputs to obtain the final output. 
(In the map/reduce model, the "combiner" feature does this.) To enable this 
feature, your eval function must implement the interface Algebraic. See 
AlgebraicEvalFunc for details.

   * '''Final cleanup action''' If your function needs to do some final action 
after being called the last time for a particular input set, it can override 
the finish method of the class !EvalFunc.
{{{
    /**
     * Placeholder for cleanup to be performed at the end. User defined 
functions can override.
     *
     */
    public void finish(){}
}}}

[Pig Wiki] Update of "EvalFunction" by OlgaN

Reply via email to