[Pig Wiki] Update of "UDFManual" by OlgaN

Apache Wiki Thu, 04 Dec 2008 14:53:23 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/UDFManual

New page:
=!! User Defined Function Guide =

Pig provides extensive support for user defined functions (UDFs) as a way to 
specify custom processing. Functions can be a part of almost every operator in 
Pig. This document describes how to use existing functions as well as how to 
write your own functions. 

*Note*: The infomation presented here is for the latest version of Pig, 
currently available on the `types` branch.

[[Anchor(Eval_Functions)]]
== Eval Functions ==

[[Anchor(How_to_Use_a_Simple_Eval_Function)]]
=== How to Use a Simple Eval Function ===

Eval is the most common type of function. It can be used in `FOREACH` 
statements as shown in this script:

{{{
-- this is myscript.pig
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
}}}

The command below can be used to run the script. Note that all examples in this 
document run in local mode for simplicity but the examples can also run in 
Hadoop mode. For more information on how to run Pig, please see the 
!PigTutorial.

{{{
java -cp pig.jar org.apache.pig.Main -x local myscript.pig
}}}

The first line of the script provides the location of the `jar file` that 
contains the UDF. (Note that there are no quotes around the jar file. Having 
quotes would result in a syntax error.) To locate the jar file, Pig first 
checks the `classpath`. If the jar file can't be found in the classpath, Pig 
assumes that the location is either an absolute path or a path relative to the 
location from which Pig was invoked. If the jar file can't be found, an error 
will be printed: `java.io.IOException: Can't read jar file: myudfs.jar`.

Multiple `register` commands can be used in the same script. If the same 
fully-qualified function is present in multiple jars, the first occurrence will 
be used consistently with Java semantics.

The name of the UDF has to be fully qualified with the package name or an error 
will be reported: `java.io.IOException: Cannot instantiate:UPPER`. Also, the 
function name is case sensitive (UPPER and upper are not the same). A UDF can 
take one or more parameters. The exact signature of the function should clear 
from its documentation.

The function provided in this example takes an ASCII string and produces its 
uppercase version. If you are familiar with column transformation functions in 
SQL, you will recognize that UPPER fits this concept. However, as we will see 
later in the document, eval functions in Pig go beyond column transformation 
functions and include aggregate and filter functions.

If you are just a user of UDFs, this is most of what you need to know about 
UDFs to use them in your code. 

[[Anchor(How_to_Write_a_Simple_Eval_Function)]]
=== How to Write a Simple Eval Function ===

Let's now look at the implementation of the `UPPER` UDF.
{{{
1. package myudfs;
2. import java.io.IOException;
3. import org.apache.pig.EvalFunc;
4. import org.apache.pig.data.Tuple;
5 import org.apache.pig.impl.util.WrappedIOException;

6. public class UPPER extends EvalFunc<String>
7. {
8.      public String exec(Tuple input) throws IOException {
9.          if (input `= null |||| input.size() =` 0)
10.              return null;
11.        try{
12.            String str = (String)input.get(0);
13.            return str.toUpperCase();
14.        }catch(Exception e){
15.            throw WrappedIOException.wrap("Caught exception processing input 
row ", e);
16.        }
17.     }
18. }
}}}

The first line indicates that the function is part of the `myudfs` package. The 
UDF class extends the `EvalFunc` class which is the base class for all eval 
functions. It is parameterized with the return type of the UDF which is a Java 
`String` in this case. We will look into the `EvalFunc` class in more detail 
later, but for now all we need to do is to implement the `exec` function. This 
function is invoked on every input tuple. The input into the function is a 
tuple with input parameters in the order they are passed to the function in the 
Pig script. In our example, it will contain a single string field corresponding 
to the student name.

The first thing to decide is what to do with invalid data. This depends on the 
format of the data. If the data is of type `bytearray` it means that it has not 
yet been converted to its proper type. In this case, if the format of the data 
does not match the expected type, a NULL value should be returned. If, on the 
other hand, the input data is of another type, this means that the conversion 
has already happened and the data should be in the correct format. This is the 
case with our example and that's why it throws an error (line 15.) Note that 
`WrappedIOException` is a helper class to convert the actual exception to an 
IOException. 

Also, note that lines 9-10 check if the input data is null or empty and if so 
returns null.

The actual function implementation is on lines 12-13 and is self-explanatory.

Now that we have the function implemented, it needs to be compiled and included 
in a jar. You will need a `pig.jar` built from the `types` branch to compile 
your UDF. You can use the following set of commands to checkout the code from 
SVN repository and create pig.jar:

{{{
svn co http://svn.apache.org/repos/asf/hadoop/pig/branches/types
cd types
ant
}}}

You should see `pig.jar` in your current working directory. The set of commands 
below first compiles the function and then creates a jar file that contains it.
{{{
cd myudfs
javac -cp pig.jar UPPER.java
cd ..
jar -cf myudfs.jar myudfs
}}}

You should now see `myudfs.jar` in your current working directory. You can use 
this jar with the script described in the previous section.

[[Anchor(Aggregate_Functions)]]
=== Aggregate Functions ===

Aggregate functions are another common type of eval function. Aggregate 
functions are usually applied to grouped data, as shown in this script:

{{{
-- this is myscript2.pig
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
B = GROUP A BY name;
C = FOREACH B GENERATE group, COUNT(A);
DUMP C;
}}}

The script above uses the `COUNT` function to count the number of students with 
the same name. There are a couple of things to note about this script. First, 
even though we are using a function, there is no `register` command. Second, 
the function is not qualified with the package name. The reason for both is 
that `COUNT` is a `builtin` function meaning that it comes with the Pig 
distribution. These are the only two differences between builtins and UDFs. 
Builtins are discussed in more detail later in this document. 

An aggregate function is an eval function that takes a bag and returns a scalar 
value. One interesting and useful property of many aggregate functions is that 
they can be computed incrementally in a distributed fashion. We call these 
functions `algebraic`. `COUNT` is an example of an algebraic function because 
we can count the number of elements in a subset of the data and then sum the 
counts to produce a final output. In the Hadoop world, this means that the 
partial computations can be done by the map and combiner, and the final result 
can be computed by the reducer.

It is very important for performance to make sure that aggregate functions that 
are algebraic are implemented as such. Let's look at the implementation of the 
COUNT function to see what this means. (Error handling and some other code is 
omitted to save space. The full code can be accessed 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/COUNT.java?view=markup][here.)

{{{
public class COUNT extends EvalFunc<Long> implements Algebraic{
    public Long exec(Tuple input) throws IOException {return count(input);}
    public String getInitial() {return Initial.class.getName();}
    public String getIntermed() {return Intermed.class.getName();}
    public String getFinal() {return Final.class.getName();}
    static public class Initial extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {return 
TupleFactory.getInstance().newTuple(count(input));}
    }
    static public class Intermed extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {return 
TupleFactory.getInstance().newTuple(sum(input));}
    }
    static public class Final extends EvalFunc<Long> {
        public Tuple exec(Tuple input) throws IOException {return sum(input);}
    }
    static protected Long count(Tuple input) throws ExecException {
        Object values = input.get(0);
        if (values instanceof DataBag) return ((DataBag)values).size();
        else if (values instanceof Map) return new Long(((Map)values).size());
    }
    static protected Long sum(Tuple input) throws ExecException, 
NumberFormatException {
        DataBag values = (DataBag)input.get(0);
        long sum = 0;
        for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
            Tuple t = it.next();
            sum += (Long)t.get(0);
        }
        return sum;
    }
}
}}}
`COUNT` implements `Algebraic` interface which looks like this:
{{{
public interface Algebraic{
    public String getInitial();
    public String getIntermed();
    public String getFinal();
}
}}}

For a function to be algebraic, it needs to implement `Algebraic` interface 
that consist of definition of three classes derived from `EvalFunc`. The 
contract is that the  `exec` function of the `Initial` class is called once and 
is passed the original input tuple. Its output is a tuple that contains partial 
results. The `exec` function of the `Intermed` class can be called zero or more 
times and takes as its input a tuple that contains partial results produced by 
the `Initial` class or by prior invocations of the `Intermed` class and 
produces a tuple with another partial result. Finally, the `exec` function of 
the `Final` class is called and produces the final result as a scalar type.

Here's the way to think about this in the Hadoop world. The `exec` function of 
the `Initial` class is invoked once by the `map` process and produces partial 
results. The `exec` function of the `Intermed` class is invoked once by each 
`combiner` invocation (which can happen zero or more times) and also produces 
partial results. The `exec` function of the `Final` class is invoked once by 
the reducer and produces the final result.

Take a look at the `COUNT` implementation to see how this is done. Note that 
the `exec` function of the `Initial` and `Intermed` classes is parameterized 
with `Tuple` and the `exec` of the `Final` class is parameterized with the real 
type of the function, which in the case of the `COUNT` is `Long`. Also, note 
that the fully-qualified name of the class needs to be returned from 
`getInitial`, `getIntermed`, and `getFinal` methods.

[[Anchor(Filter_Functions)]]
=== Filter Functions ===

Filter functions are eval functions that return a `boolean` value. Filter 
functions can be used anywhere a Boolean expression is appropriate, including 
the `FILTER` operator or `bincond` expression.

The example below uses the `IsEmpy` builtin filter function to implement joins.

{{{
-- inner join
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararay, 
contributions: float);
C = COGROUP A BY name, B BY name;
D = FILTER C BY not IsEmpty(A);
E = FILTER D BY not IsEmpty(B);
F = FOREACH E GENERATE flatten(A), flatten(B);
DUMP F;
}}}

Note that, even if filtering is omitted, the same results will be produced 
because the `foreach` results is a cross product and cross products get rid of 
empty bags. However, doing up-front filtering is more efficient since it 
reduces the input of the cross product.

{{{
-- full outer join
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararay, 
contributions: float);
C = COGROUP A BY name, B BY name;
D = FOREACH C GENERATE group, flatten((IsEmpty(A) ? null : A)), 
flatten((IsEmpty(B) ? null : B));
dump D;
}}}

The implementation of the `IsEmpty` function looks like this:
{{{
import java.io.IOException;
import java.util.Map;
import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataType;
import org.apache.pig.impl.util.WrappedIOException;

public class IsEmpty extends FilterFunc {
    public Boolean exec(Tuple input) throws IOException {
        if (input `= null |||| input.size() =` 0)
            return null;
        try {
            Object values = input.get(0);
            if (values instanceof DataBag)
                return ((DataBag)values).size() == 0;
            else if (values instanceof Map)
                return ((Map)values).size() == 0;
            else{
                throw new IOException("Cannot test a " +
                    DataType.findTypeName(values) + " for emptiness.");
            }
        } catch (ExecException ee) {
            throw WrappedIOException.wrap("Caught exception processing input 
row ", ee);   
        }
    }
}
}}}
[[Anchor(Pig_Types)]]
=== Pig Types ===

The main thing to know about Pig's type system is that Pig uses native Java 
types for almost all of its types, as shown in this table.

|| Pig Type || Java Class ||
|| bytearray || !DataByteArray ||
|| chararray || String ||
|| int || Integer ||
|| long || Long ||
|| float || Float ||
|| double || Double ||
|| tuple || Tuple ||
|| bag || !DataBag ||
|| map || Map<Object, Object> ||

All Pig-specific classes are available 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/][here.
 

`Tuple` and `DataBag` are different in that they are not concrete classes but 
rather interfaces. This enables users to extend Pig with their own versions of 
tuples and bags. As a result, UDFs cannot directly instantiate bags or tuples; 
they need to go through factory classes: `TupleFactory` and `BagFactory`.

The builtin `TOKENIZE` function shows how bags and tuples are created. A 
function takes a text string as input and returns a bag of words from the text. 
(Note that currently Pig bags always contain tuples.)

{{{
1.  package org.apache.pig.builtin;

2.  import java.io.IOException;
3.  import java.util.StringTokenizer;
4.  import org.apache.pig.EvalFunc;
5.  import org.apache.pig.data.BagFactory;
6.  import org.apache.pig.data.DataBag;
7.  import org.apache.pig.data.Tuple;
8.  import org.apache.pig.data.TupleFactory;

9.  public class TOKENIZE extends EvalFunc<DataBag> {
10.TupleFactory mTupleFactory = TupleFactory.getInstance();
11.BagFactory mBagFactory = BagFactory.getInstance();

12.   public DataBag exec(Tuple input) throws IOException {
13.        try {
14.            DataBag output = mBagFactory.newDefaultBag();
15.            Object o = input.get(0);
16.            if (!(o instanceof String)) {
17.                throw new IOException("Expected input to be chararray, but  
got " + o.getClass().getName());
18.            }
19.            StringTokenizer tok = new StringTokenizer((String)o, " \",()*", 
false);
20.            while (tok.hasMoreTokens()) 
output.add(mTupleFactory.newTuple(tok.nextToken()));
21.            return output;
22.        } catch (ExecException ee) {
23.            // error handling goes here
24.        }
25.    }
26. }
}}}

Lines 10 and 11 create tuple and bag factories respectively. (Factory is a 
class that creates objects of a particular type. For more details, see the 
definition of http://en.wikipedia.org/wiki/Factory_pattern][a factory pattern. 
The factory class itself is implemented as a singleton to guarantee that the 
same factory is used everywhere. For more details see the definition of 
http://en.wikipedia.org/wiki/Singleton_pattern][a singleton pattern.)

Line 14 creates a bag using the factory that will contain the output of the 
function. Line 20 creates a tuple for each token and adds it to the output bag.

[[Anchor(Schema)]]
=== Schema ===

The latest version of Pig uses type information for validation and performance. 
It is important for UDFs to participate in type propagation. Until now, our 
UDFs made no effort to communicate their output schema to Pig. This is because, 
most of the time, Pig can figure out this information by using Java's 
http://java.sun.com/developer/technicalArticles/ALT/Reflection/][Reflection. If 
your UDF returns a scalar or a map, no work is required. However, if your UDF 
returns a `tuple` or a `bag` (of tuples), it needs to help Pig figure out the 
structure of the tuple.

If a UDF returns a `tuple` or a `bag` and schema information is not provided, 
Pig assumes that the tuple contains a single field of type `bytearray`. If this 
is not the case, then not specifying the schema can cause failures. We look at 
this next.

Let's assume that we have UDF `Swap` that, given a tuple with two fields, swaps 
their order. Let's assume that the UDF does not specify a schema and look at 
the scripts below:

{{{
1. register myudfs.jar;
2. A = load 'student_data' as (name: chararray, age: int, gpa: float);
3. B = foreach A generate flatten(myudfs.Swap(name, age)), gpa;
4. C = foreach B generate $2;
5. D = limit B 20;
6. dump D;
}}}

This script will result in the following error cause by line 4.
{{{
java.io.IOException: Out of bound access. Trying to access non-existent column: 
2. Schema {bytearray,gpa: float} has 2 column(s).
}}}

This is because Pig is only aware of two columns in B while line 4 is 
requesting the third column of the tuple. (Column indexing in Pig starts with 
0.)

The function, including the schema, looks like this:

{{{
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class Swap extends EvalFunc<Tuple> {
    public Tuple exec(Tuple input) throws IOException {
        if (input == null |||| input.size() < 2)
            return null;
        try{
            Tuple output = TupleFactory.getInstance().newTuple(2);
            output.set(0, input.get(1));
            output.set(1, input.get(0));
            return output;
        } catch(Exception e){
            System.err.println("Failed to process input; error - " + 
e.getMessage());
            return null;
        }
    }
    public Schema outputSchema(Schema input) {
        try{
            Schema tupleSchema = new Schema();
            tupleSchema.add(input.getField(1));
            tupleSchema.add(input.getField(0));
            return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input),
                                                                              
tupleSchema, DataType.TUPLE));
        }catch (Exception e){
                return null;
        }
    }
}
}}}

The function creates a schema with a single field (of type `FieldSchema=) of 
type =tuple`. The name of the field is constructed using the `getSchemaName` 
function of the `EvalFunc` class. The name consists of the name of the UDF 
function, the first parameter passed to it, and a sequence number to guarantee 
uniqueness. In the previous script, if you replace `dump D;`  with `describe 
B;` , you will see the following output:
{{{
B: {myudfs.swap_age_3::age: int,myudfs.swap_age_3::name: chararray,gpa: float}
}}}

The second parameter to the `FieldSchema` constructor is the schema 
representing this field, which in this case is a tuple with two fields. The 
third parameter represents the type of the schema, which in this case is a 
`TUPLE`. All supported schema types are defined in the 
`org.apache.pig.data.DataType` class.

{{{
public class DataType {
    public static final byte UNKNOWN   =   0;
    public static final byte NULL      =   1;
    public static final byte BOOLEAN   =   5; // internal use only
    public static final byte BYTE      =   6; // internal use only
    public static final byte INTEGER   =  10;
    public static final byte LONG      =  15;
    public static final byte FLOAT     =  20;
    public static final byte DOUBLE    =  25;
    public static final byte BYTEARRAY =  50;
    public static final byte CHARARRAY =  55;
    public static final byte MAP       = 100;
    public static final byte TUPLE     = 110;
    public static final byte BAG       = 120;
    public static final byte ERROR     =  -1;
    // more code here
}
}}}

You need to import the `org.apache.pig.data.DataType` class into your code to 
define schemas. You also need to import the schema class 
`org.apache.pig.impl.logicalLayer.schema.Schema`.

The example above shows how to create an output schema for a tuple. Doing this 
for a bag is very similar. Let's extend the `TOKENIZE` function to do that:

{{{
package org.apache.pig.builtin;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class TOKENIZE extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();
    public DataBag exec(Tuple input) throws IOException {
        try {
            DataBag output = mBagFactory.newDefaultBag();
            Object o = input.get(0);
            if (!(o instanceof String)) {
                throw new IOException("Expected input to be chararray, but  got 
" + o.getClass().getName());
            }
            StringTokenizer tok = new StringTokenizer((String)o, " \",()*", 
false);
            while (tok.hasMoreTokens()) 
output.add(mTupleFactory.newTuple(tok.nextToken()));
            return output;
        } catch (ExecException ee) {
            // error handling goes here
        }
    }
    public Schema outputSchema(Schema input) {
         try{
             Schema bagSchema = new Schema();
             bagSchema.add(new Schema.FieldSchema("token", DataType.CHARARRAY));

             return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input),
                                                    bagSchema, DataType.BAG));
         }catch (Exception e){
            return null;
         }
    }
}
}}}

As you can see, this is very similar to the output schema definition in the 
`Swap` function. One difference is that instead of reusing input schema, we 
create a brand new field schema to represent the tokens stored in the bag. The 
other difference is that the type of the schema created is `BAG` (not =TUPLE=).

[[Anchor(Error_Handling)]]
=== Error Handling ===

There are several types of errors that can occur in a UDF:

   1.. An error that affects a particular row but is not likely to impact other 
rows. An example of such an error would be a malformed input value or divide by 
zero problem. A reasonable handling of this situation would be to emit a 
warning and return a null value. `ABS` function in the next section 
demonstrates this approach. The current approach is to write the warning to 
`stderr`. Eventually we would like to pass a logger to the UDFs. Note that 
returning a NULL value only makes sense if the malformed value is of type 
`bytearray`. Otherwise the proper type has been already created and should have 
an appropriate value. If this is not the case, it is an internal error and 
should cause the system to fail. Both cases can be seen in the implementation 
of the `ABS` function in the next section.
   2.. An error that affects the entire processing but can succeed on retry. An 
example of such a failure is the inability to open a lookup file because the 
file could not be found. This could be a temporary environmental issue that can 
go away on retry. A UDF can signal this to Pig by throwing an `IOException` as 
with the case of the `ABS` function below.
   3.. An error that affects the entire processing and is not likely to succeed 
on retry. An example of such a failure is the inability to open a lookup file 
because of file permission problems. Pig currently does not have a way to 
handle this case. Hadoop does not have a way to handle this case either. It 
will be handled the same way as 2 above.

Pig provides a helper class `WrappedIOException`. The intent here is to allow 
you to convert any exception into `IOException`. Its usage can be seen in the 
`UPPER` function in our first example.

[[Anchor(Function_Overloading)]]
=== Function Overloading ===

Before the type system was available in Pig, all values for the purpose of 
arithmetic calculations were assumed to be doubles as the safest choice. 
However, this is not very efficient if the data is actually of type integer or 
long. (We saw about a 2x slowdown of a query when using double where integer 
could be used.) Now that Pig supports types we can take advantage of the type 
information and choose the function that is most efficient for the provided 
operands.

UDF writers are encouraged to provide type-specific versions of a function if 
this can result in better performance. On the other hand, we don't want the 
users of the functions to worry about different functions - the right thing 
should just happen. Pig allows for this via a function table mechanism as shown 
in the next example.

This example shows the implementation of the `ABS` function that returns the 
absolute value of a numeric value passed to it as input.

{{{
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;
import org.apache.pig.EvalFunc;
import org.apache.pig.FuncSpec;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.util.WrappedIOException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class ABS extends EvalFunc<Double>{
    public Double exec(Tuple input) throws IOException {
        if (input `= null |||| input.size() =` 0)
            return null;
        Double d;
        try{
            d = DataType.toDouble(input.get(0));
        } catch (NumberFormatException nfe){
            System.err.println("Failed to process input; error - " + 
nfe.getMessage());
            return null;
        } catch (Exception e){
            throw WrappedIOException.wrap("Caught exception processing input 
row ", e);
        }
        return Math.abs(d);
    }
    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
        funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new 
Schema.FieldSchema(null, DataType.BYTEARRAY))));
        funcList.add(new FuncSpec(DoubleAbs.class.getName(),  new Schema(new 
Schema.FieldSchema(null, DataType.DOUBLE))));
        funcList.add(new FuncSpec(FloatAbs.class.getName(),   new Schema(new 
Schema.FieldSchema(null, DataType.FLOAT))));
        funcList.add(new FuncSpec(IntAbs.class.getName(),  new Schema(new 
Schema.FieldSchema(null, DataType.INTEGER))));
        funcList.add(new FuncSpec(LongAbs.class.getName(),  new Schema(new 
Schema.FieldSchema(null, DataType.LONG))));
        return funcList;
    }
}
}}}

The main thing to notice in this example is the `getArgToFuncMapping()` method. 
This method returns a list that contains a mapping from the input schema to the 
class that should be used to handle it. In this example the main class handles 
the `bytearray` input and outsources the rest of the work to other classes 
implemented in separate files in the same package. The example of one such 
class is below. This class handles integer input values.

{{{
import java.io.IOException;
import org.apache.pig.impl.util.WrappedIOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class IntAbs extends EvalFunc<Integer>{
    public Integer exec(Tuple input) throws IOException {
        if (input `= null |||| input.size() =` 0)
            return null;
        Integer d;
        try{
            d = (Integer)input.get(0);
        } catch (Exception e){
            throw WrappedIOException.wrap("Caught exception processing input 
row ", e);
        }
        return Math.abs(d);
    }
}
}}}

A note on error handling. The `ABS` class covers the case of the `bytearray` 
which means the data has not been converted yet to its actual type. This is why 
a null value is returned when `NumberFormatException` is encountered. However, 
the `IntAbs` function is only called if the data is already of type `Integer` 
which means it has already been converted to the real type and bad format has 
been dealt with. This is why an exception is thrown if the input can't be cast 
to `Integer`.

The example above covers a reasonably simple case where the UDF only takes one 
parameter and there is a separate function for each parameter type. However, 
this will not always be the case. If Pig can't find an `exact match` it tries 
to do a `best match`. The rule for the best match is to find the most efficient 
function that can be used safely. This means that Pig must find the function 
that, for each input parameter, provides the smallest type that is equal to or 
greater than the input type. The type progression rules are: 
`int=->=long=->=float=->=double`.

For instance, let's consider function `MAX` which is part of the `piggybank` 
described later in this document. Given two values, the function returns the 
larger value. The function table for `MAX` looks like this:

{{{
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
    List<FuncSpec> funcList = new ArrayList<FuncSpec>();
    Util.addToFunctionList(funcList, IntMax.class.getName(), DataType.INTEGER);
    Util.addToFunctionList(funcList, DoubleMax.class.getName(), 
DataType.DOUBLE);
    Util.addToFunctionList(funcList, FloatMax.class.getName(), DataType.FLOAT);
    Util.addToFunctionList(funcList, LongMax.class.getName(), DataType.LONG);

    return funcList;
}
}}}

The `Util.addToFunctionList` function is a helper function that adds an entry 
to the list as the first argument, with the key of the class name passed as the 
second argument, and the schema containing two fields of the same type as the 
third argument. 

Let's now see how this function can be used in a Pig script:

{{{
REGISTER piggybank.jar
A = LOAD 'student_data' AS (name: chararray, gpa1: float, gpa2: double);
B = FOREACH A GENERATE name, org.apache.pig.piggybank.evaluation.math.MAX(gpa1, 
gpa2);
DUMP B;
}}}

In this example, the function gets one parameter of type `float` and another of 
type `double`. The best fit will be the function that takes two double values. 
Pig makes this choice on the user's behalf by inserting implicit casts for the 
parameters. Running the script above is equivalent to running the script below:

{{{
A = LOAD 'student_data' AS (name: chararray, gpa1: float, gpa2: double);
B = FOREACH A GENERATE name, 
org.apache.pig.piggybank.evaluation.math.MAX((double)gpa1, gpa2);
DUMP B;
}}}

A special case of the `best fit` approach is handling data without a schema 
specified. The type for this data is interpreted as `bytearray`. Since the type 
of the data is not known, there is no way to choose a best fit version. The 
only time a cast is performed is when the function table contains only a single 
entry. This works well to maintain backward compatibility. 

Let's revisit the `UPPER` function from our first example. As it is written 
now, it would only work if the data passed to it is of type `chararray`. To 
make it work with data whose type is not explicitly set, a function table with 
a single entry needs to be added:

{{{
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UPPER extends EvalFunc<String>
{
    public String exec(Tuple input) throws IOException {
        if (input `= null |||| input.size() =` 0)
            return null;
        try{
            String str = (String)input.get(0);
            return str.toUpperCase();
        }catch(Exception e){
            System.err.println("WARN: UPPER: failed to process input; error - " 
+ e.getMessage());
            return null;
        }
    }
    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
        funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new 
Schema.FieldSchema(null, DataType.CHARARRAY))));
        return funcList;
    }
}
}}}

Now the following script will ran:

{{{
-- this is myscript.pig
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name, age, gpa);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
}}}

[[Anchor(Reporting_Progress)]]
=== Reporting Progress ===

A challenge of running a large shared system is to make sure system resources 
are used efficiently. One aspect of this challenge is detecting runaway 
processes that are no longer making progress. Pig uses a heartbeat mechanism 
for this purpose. If any of the tasks stops sending a heartbeat, the system 
assumes that it is dead and kills it.  

Most of the time, single-tuple processing within a UDF is very short and does 
not require a UDF to heartbeat. The same is true for aggregate functions that 
operate on large bags because bag iteration code takes care of it. However, if 
you have a function that performs a complex computation that can take an order 
of minutes to execute, you should add a progress indicator to your code. This 
is very easy to accomplish. The `EvalFunc` function provides a `progress` 
function that you need to call in your `exec` method.

For instance, the `UPPER` function would now look as follows:

{{{
public class UPPER extends EvalFunc<String>
{
        public String exec(Tuple input) throws IOException {
                if (input `= null |||| input.size() =` 0)
                return null;
                try{
                        reporter.progress();
                        String str = (String)input.get(0);
                        return str.toUpperCase();
                }catch(Exception e){
                    throw WrappedIOException.wrap("Caught exception processing 
input row ", e);
                }
        }
}
}}}

[[Anchor(Load/Store_Functions)]]
== Load/Store Functions ==

These user defined functions control how data goes into Pig and comes out of 
Pig. Often, the same function handles both input and output but that does not 
have to be the case.

[[Anchor(Load_Functions)]]
=== Load Functions ===
Every load function needs to implement the `LoadFunc` interface. An abbreviated 
version is shown below. The full definition can be seen 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/LoadFunc.java?view=markup][here.

{{{
public interface LoadFunc {
    public void bindTo(String fileName, BufferedPositionedInputStream is, long 
offset, long end) throws IOException;
    public Tuple getNext() throws IOException;
    // conversion functions
    public Integer bytesToInteger(byte[] b) throws IOException;
    public Long bytesToLong(byte[] b) throws IOException;
    ......
    public void fieldsToRead(Schema schema);
    public Schema determineSchema(String fileName, ExecType execType, 
DataStorage storage) throws IOException;
}}}

The `bindTo` function is called once by each Pig task before it starts 
processing data. It is intended to connect the function to its input. It 
provides the following information:
   * `fileName` - The name of the file from which the data is read. Not used 
most of the time
   * `is` - The input stream from which the data is read. It is already 
positioned at the place where the function needs to start reading
   * `offset` - The offset into the stream from which to read. It is equivalent 
to `is.getPosition()` and not strictly needed
   * `end` - The position of the last byte that should be read by the function.
In the Hadoop world, the input data is treated as a continuous stream of bytes. 
A `slicer`, discussed in the Advanced Topics section, is used to split the data 
into chunks with each chunk going to a particular task for processing. This 
chunk is what `bindTo` provides to the UDF. Note that unless you use a custom 
slicer, the default slicer is not aware of tuple boundaries. This means that 
the chunk you get can start and end in the middle of a particular tuple. One 
common approach is to skip the first partial tuple and continue past the end 
position to finish processing a tuple. This is what `PigStorage` does as the 
example later in this section shows.

The `getNext` function reads the input stream and constructs the next tuple. It 
returns `null` when it is done with processing and throws an `IOException` if 
it fails to process an input tuple.

Next is a bunch of conversion routines that convert data from `bytearray` to 
the requested type. This requires further explanation. By default, we would 
like the loader to do as little per-tuple processing as possible. This is 
because many tuples can be thrown out during filtering or joins. Also, many 
fields might not get used because they get projected out. If the data needs to 
be converted into another form, we would like this conversion to happen as late 
as possible. The majority of the loaders should return the data as bytearrays 
and the Pig will request a conversion from bytearray to the actual type when 
needed. Let's looks at the example below:

{{{
A = load 'student_data' using PigStorage() as (name: chararray, age: int, gpa: 
float);
B = filter A by age >25;
C = foreach B generate name;
dump C;
}}}

In this query, only `age` needs to be converted to its actual type (=int=) 
right away. `name` only needs to be converted in the next step of processing 
where the data is likely to be much smaller. `gpa` is not used at all and will 
never need to be converted.

This is the main reason for Pig to separate the reading of the data (which can 
happen immediately) from the converting of the data (to the right type, which 
can happen later). For ASCII data, Pig provides `Utf8StorageConverter` that 
your loader class can extend and will take care of all the conversion routines. 
The code for it can be found 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup.][here.

Note that conversion rutines should return null values for data that can't be 
converted to the specified type.

Loaders that work with binary data like `BinStorage` are not going to use this 
model. Instead, they will produce objects of the appropriate types. However, 
they might still need to define conversion routines in case some of the fields 
in a tuple are of type `bytearray`.

`fieldsToRead` is reserved for future use and should be left empty.

The `determineSchema` function must be implemented by loaders that return real 
data types rather than `bytearray` fields. Other loaders should just return 
`null`. The idea here is that Pig needs to know the actual types it will be 
getting; Pig will call `determineSchema` on the client side to get this 
information. The function is provided as a way to sample the data to determine 
its schema. 

Here is the example of the function implemented by =BinStorage=:

{{{
public Schema determineSchema(String fileName, ExecType execType, DataStorage 
storage) throws IOException {
    InputStream is = FileLocalizer.open(fileName, execType, storage);
    bindTo(fileName, new BufferedPositionedInputStream(is), 0, Long.MAX_VALUE);
        // get the first record from the input file and figure out the schema 
        Tuple t = getNext();
        if(t == null) return null;
        int numFields = t.size();
        Schema s = new Schema();
        for (int i = 0; i < numFields; i++) {
            try {
                s.add(DataType.determineFieldSchema(t.get(i)));
            } catch (Exception e) {
                throw WrappedIOException.wrap(e);
            }
        }
        return s;
    }
}}}

Note that this approach assumes that the data has a uniform schema. The 
function needs to make sure that the data it produces conforms to the schema 
returned by `determineSchema`, otherwise the processing will fail. This means 
producing the right number of fields in the tuple (dropping fields or emitting 
null values if needed) and producing fields of the right type (again emitting 
null values as needed).

For complete examples, see 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/BinStorage.java?view=markup][BinStroage
 and 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/PigStorage.java?view=markup][PigStorage.

[[Anchor(Store_Functions)]]
=== Store Functions ===

All store functions need to implement the `StoreFunc` interface:

{{{
public interface StoreFunc {
    public abstract void bindTo(OutputStream os) throws IOException;
    public abstract void putNext(Tuple f) throws IOException;
    public abstract void finish() throws IOException;
}
}}}

The `bindTo` method is called in the beginning of the processing to connect the 
store function to the output stream it will write to. The `putNext` method is 
called for every tuple to be stored and is responsible for writing the tuple 
into the output. The `finish` function is called at the end of the processing 
to do all needed cleanup like flushing the output stream.

Here is an example of a simple store function that writes data as a string 
returned from the `toString` function.

{{{
public class StringStore implements StoreFunc {
    OutputStream os;
    private byte recordDel = (byte)'\n';
    public void bindTo(OutputStream os) throws IOException
    {
        this.os = os;
    }
    public void putNext(Tuple t) throws IOException
    {
        os.write((t.toString() + (char)this.recordDel).getBytes("utf8"));
    }
    public void finish() throws IOException
    {
         os.flush();
    }
}
}}}

[[Anchor(Comparison_Functions)]]
== Comparison Functions ==

Comparison UDFs are mostly obsolete now. They were added to the language 
because, at that time, the `ORDER` operator had two significant shortcomings. 
First, it did not allow descending order and, second, it only supported 
alphanumeric order.

The latest version of Pig solves both of these issues. The 
http://wiki.apache.org/pig/UserDefinedOrdering][pointer to the original 
documentation is provided here for completeness.

[[Anchor(Builtin_Functions_and_Function_Repositories)]]
== Builtin Functions and Function Repositories ==

Pig comes with a set of builtin in functions. (NEED LINK) Two main properties 
differentiate builtin functions from UDFs. First, they don't need to be 
registered because Pig knows where they are. Second, they don't need to be 
qualified when used because Pig knows where to find them.

In addition to builtins, Pig hosts a UDF repository called `piggybank` that 
allows users to share UDFs that they have written. The details are described in 
http://wiki.apache.org/pig/PiggyBank][PiggyBank.

[[Anchor(Advanced_Topics)]]
== Advanced Topics ==

[[Anchor(Function_Instantiation)]]
=== Function Instantiation ===

One problem that users run into is when they make assumption about how many 
times a constructor for their UDF is called. For instance, they might be 
creating side files in the store function and doing it in the constructor seems 
like a good idea. The problem with this approach is that in most cases Pig 
instantiates functions on the client side to, for instance, examine the schema 
of the data. 

Users should not make assumptions about how many times a function is 
instantiated; instead, they should make their code resilient to multiple 
instantiations. For instance, they could check if the files exist before 
creating them.

[PASSING PARAMETERS]

[[Anchor(Schemas)]]
=== Schemas ===

One request from users is to have the ability to examine the input schema of 
the data before processing the data. For example, they would like to know how 
to convert an input tuple to a map such that the keys in the map are the names 
of the input columns. The current answer is that there is now way to do this. 
This is something we would like to support in the future.

[schema overriding rules]
[[Anchor(Custom_Slicer)]]
=== Custom Slicer ===

Sometimes a `LoadFunc` needs more control over how input is chopped up or even 
found.

Here are some scenarios that call for a custom slicer:
   * Input needs to be chopped up differently than on block boundaries. 
(Perhaps you want every 1M instead of every 128M. Or, you may want to process 
in big 1G chunks.)
   * Input comes from a source outside of HDFS. (Perhaps you are reading from a 
database.)
   * There are locality preferences for processing the data that is more than 
simple HDFS locality.
   * Extra information needs to be passed from the client machine to the 
`LoadFunc` instances running remotely.

All of these scenarios are addressed by slicers. There are two parts to the 
slicing framework: `Slicer`, the class that creates slices, and `Slice`, the 
class that represents a particular piece of the input. Slicing kicks in when 
Pig sees that the `LoadFunc` implements the `Slicer` interface.

[[Anchor(Slicer)]]
==== Slicer ====

The slicer has two basic functions: validate input and slice up the input. Both 
of these methods will be called on the client machine. 

{{{
public interface Slicer {
    void validate(DataStorage store, String location) throws IOException;
   Slice[] slice(DataStorage store, String location) throws IOException;
}
}}}

The implementer of `Slicer` is responsible for checking that the input 
specified is valid and if not, throwing an `IOException`. The implementor is 
free to use the parameters of the `validate` and `slice` methods in anyway they 
see fit. The `store` parameter will be the current `DataStorage` object in 
effect for the current instance of `PigServer` and `location` will be the 
string passed to the `LoadFunc`. Once the input has been validated, the 
`Slicer` will be ask to chop up the file into `Slice` s. `Slicer` addresses the 
needs of the first two scenarios.

[[Anchor(Slice)]]
==== Slice ====

Each slice describes a unit of work and will correspond to a map task in Hadoop.

{{{
public interface Slice extends Serializable {
    String[] getLocations();
    void init(DataStorage store) throws IOException;
    long getStart();
    long getLength();
    void close() throws IOException;
    long getPos() throws IOException;
    float getProgress() throws IOException;
    boolean next(Tuple value) throws IOException;
}
}}}

Only one of the methods is used for scheduling: `getLocations()`. This method 
allows the implementor to give hints to Pig about where the task should be run. 
It is only a hint. If things are busy, the task may get scheduled elsewhere.

The rest of the `Slice` methods are used to read records on the processing 
nodes. `init` is called right after the `Slice` object is deserialized and 
`close` is called after the last record has been read. The Pig runtime will 
read records from the `Slice` until `getPos()` exceeds `getLength()`. Because 
`Slice` implements serializable, `Slicer` can encode information in the `Slice` 
that will later be available when the task is run.

[[Anchor(Example)]]
==== Example ====

This example shows a simple `Slicer` that gets a count from the input stream 
and generates that number of `Slice` s. 

{{{
public class RangeSlicer implements Slicer {
    /**
     * Expects location to be a Stringified integer, and makes
     * Integer.parseInt(location) slices. Each slice generates a single value,
     * its index in the sequence of slices.
     */
    public Slice[] slice (DataStorage store, String location) throws 
IOException {
        // Note: validate has already made sure that location is an integer
        int numslices = Integer.parseInt(location);
        Slice[] slices = new Slice[numslices];
        for (int i = 0; i < slices.length; i++) {
            slices[i] = new SingleValueSlice(i);
        }
        return slices;
    }
    public void validate(DataStorage store, String location) throws IOException 
{
        try {
            Integer.parseInt(location);
        } catch (NumberFormatException nfe) {
            throw new IOException(nfe.getMessage());
        }
    }
    /**
     * A Slice that returns a single value from next.
     */
    public static class SingleValueSlice implements Slice {
        // note this value is set by the Slicer and will get serialized and 
deserialized at the remote processing node
        public int val;
        // since we just have a single value, we can use a boolean rather than 
a counter
        private transient boolean read;
        public SingleValueSlice (int value) {
            this.val = value;
        }
        public void close () throws IOException {}
        public long getLength () { return 1; }
        public String[] getLocations () { return new String[0]; }
        public long getStart() { return 0; }
        public long getPos () throws IOException { return read ? 1 : 0; }
        public float getProgress () throws IOException { return read ? 1 : 0; }
        public void init (DataStorage store) throws IOException {}
        public boolean next (Tuple value) throws IOException {
            if (!read) {
                value.appendField(new DataAtom(val));
                read = true;
                return true;
            }
            return false;
        }
        private static final long serialVersionUID = 1L;
    }
}
}}}

You can invoke the `RangeSlicer` class with the following Pig Latin statement:

{{{
LOAD '27' USING RangeSlicer();
}}}
[[Anchor(Complex_Function_Example)]]
=== Complex Function Example ===

[Pig Wiki] Update of "UDFManual" by OlgaN

Reply via email to