Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by OlgaN:

New page:
== Eval Functions ==
To create an eval function, the following abstract class must be extended. The 
parameter T is the return type of the eval function. 

public abstract class EvalFunc<T extends Datum>  {
    abstract public void exec(Tuple input, T output) throws IOException;

=== Input to the Functions ===
The arguments to the function get wrapped in a tuple and are passed as the 
parameter `input` above. Thus, the first field of `input` is the first argument 
and so on. 

For example, suppose I have a data set A = 
<a, b, c>
<1, 2, 3>

Suppose, I have written an Eval Function !MyFunc and my !PigLatin is as follows:

B = foreach A generate MyFunc($0,$2);

Then !MyFunc will be called first with the tuple <a, c> and then with the tuple 
<1, 3>. 

=== Output of the functions ===

When extending the abstract class, the type parameter T must be bound to a 
subclass of Datum. (The compiler will allow you to subclass !EvalFunc<Datum> 
but you will get an error on using that function). When T is bound to a 
particular type of Datum ( !DataAtom, or Tuple, or !DataBag, or !DataMap), the 
eval function gets handed, through the parameter `output`, a Datum of type T to 
produce its output in. 

Note that in case T is a databag, although you get handed a !DataBag as the 
parameter `output`, this is an append-only data bag. Its contents always remain 
empty. This is a performance optimization (we use it for pipelining) based on 
the assumption that you wouldnt want to examine your own output.

=== Example ===

As an example, here is the code for the builtin function TOKENIZE, that expects 
as input 1 argument of type data atom, and tokenizes the input data atom string 
to a data bag of tuples, one for each word in the input string.

public class TOKENIZE extends EvalFunc<DataBag> {

    public void exec(Tuple input, DataBag output) throws IOException {
        String str = input.getAtomField(0).strval();
        StringTokenizer tok = new StringTokenizer(str, " \",()*", false);
        while (tok.hasMoreTokens()) {
            output.add(new Tuple(tok.nextToken()));

=== Advanced Features ===
   * '''Schemas''': Eval functions can declare their output schema by 
overriding the following method in !EvalFunc. See: PigLatinSchemas.

     * @param input Schema of the input
     * @return Schema of the output
    public Schema outputSchema(Schema input)
          return input.copy();

   * '''Algebraic Eval Functions''' If the input to your function might be 
large (i.e. the input tuple may contain a large bag of tuples nested inside of 
it) and you are concerned about performance, you may want to consider writing 
your function in such a way that it can receive its input in small "chunks," 
one at a time, and then merge the per-chunk outputs to obtain the final output. 
(In the map/reduce model, the "combiner" feature does this.) To enable this 
feature, your eval function must implement the interface Algebraic. See 
AlgebraicEvalFunc for details.

   * '''Final cleanup action''' If your function needs to do some final action 
after being called the last time for a particular input set, it can override 
the finish method of the class !EvalFunc.
     * Placeholder for cleanup to be performed at the end. User defined 
functions can override.
    public void finish(){}

Reply via email to