Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/EvalFunction New page: [[Anchor(Eval_Functions)]] == Eval Functions == To create an eval function, the following abstract class must be extended. The parameter T is the return type of the eval function. {{{ public abstract class EvalFunc<T extends Datum> { abstract public void exec(Tuple input, T output) throws IOException; } }}} [[Anchor(Input_to_the_Functions)]] === Input to the Functions === The arguments to the function get wrapped in a tuple and are passed as the parameter `input` above. Thus, the first field of `input` is the first argument and so on. For example, suppose I have a data set A = {{{ <a, b, c> <1, 2, 3> }}} Suppose, I have written an Eval Function !MyFunc and my !PigLatin is as follows: {{{ B = foreach A generate MyFunc($0,$2); }}} Then !MyFunc will be called first with the tuple <a, c> and then with the tuple <1, 3>. [[Anchor(Output_of_the_functions)]] === Output of the functions === When extending the abstract class, the type parameter T must be bound to a subclass of Datum. (The compiler will allow you to subclass !EvalFunc<Datum> but you will get an error on using that function). When T is bound to a particular type of Datum ( !DataAtom, or Tuple, or !DataBag, or !DataMap), the eval function gets handed, through the parameter `output`, a Datum of type T to produce its output in. Note that in case T is a databag, although you get handed a !DataBag as the parameter `output`, this is an append-only data bag. Its contents always remain empty. This is a performance optimization (we use it for pipelining) based on the assumption that you wouldnt want to examine your own output. [[Anchor(Example)]] === Example === As an example, here is the code for the builtin function TOKENIZE, that expects as input 1 argument of type data atom, and tokenizes the input data atom string to a data bag of tuples, one for each word in the input string. {{{ public class TOKENIZE extends EvalFunc<DataBag> { @Override public void exec(Tuple input, DataBag output) throws IOException { String str = input.getAtomField(0).strval(); StringTokenizer tok = new StringTokenizer(str, " \",()*", false); while (tok.hasMoreTokens()) { output.add(new Tuple(tok.nextToken())); } } }}} [[Anchor(Advanced_Features)]] === Advanced Features === * '''Schemas''': Eval functions can declare their output schema by overriding the following method in !EvalFunc. See: PigLatinSchemas. {{{ /** * @param input Schema of the input * @return Schema of the output */ public Schema outputSchema(Schema input) { return input.copy(); } }}} * '''Algebraic Eval Functions''' If the input to your function might be large (i.e. the input tuple may contain a large bag of tuples nested inside of it) and you are concerned about performance, you may want to consider writing your function in such a way that it can receive its input in small "chunks," one at a time, and then merge the per-chunk outputs to obtain the final output. (In the map/reduce model, the "combiner" feature does this.) To enable this feature, your eval function must implement the interface Algebraic. See AlgebraicEvalFunc for details. * '''Final cleanup action''' If your function needs to do some final action after being called the last time for a particular input set, it can override the finish method of the class !EvalFunc. {{{ /** * Placeholder for cleanup to be performed at the end. User defined functions can override. * */ public void finish(){} }}}