[Pig Wiki] Update of "UDFManual" by OlgaN

Apache Wiki Tue, 13 Jan 2009 16:49:06 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/UDFManual

------------------------------------------------------------------------------
  = User Defined Function Guide =
  
  Pig provides extensive support for user defined functions (UDFs) as a way to 
specify custom processing. Functions can be a part of almost every operator in 
Pig. This document describes how to use existing functions as well as how to 
write your own functions. 
- 
- '''Note''': The infomation presented here is for the latest version of Pig, 
currently available on the `types` branch.
  
  [[Anchor(Eval_Functions)]]
  == Eval Functions ==
@@ -72, +70 @@

  
  The actual function implementation is on lines 13-14 and is self-explanatory.
  
- Now that we have the function implemented, it needs to be compiled and 
included in a jar. You will need a `pig.jar` built from the `types` branch to 
compile your UDF. You can use the following set of commands to checkout the 
code from SVN repository and create pig.jar:
+ Now that we have the function implemented, it needs to be compiled and 
included in a jar. You will need to build `pig.jar` to compile your UDF. You 
can use the following set of commands to checkout the code from SVN repository 
and create pig.jar:
  
  {{{
- svn co http://svn.apache.org/repos/asf/hadoop/pig/branches/types
+ svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk
- cd types
+ cd trunk
  ant
  }}}
  
@@ -107, +105 @@

  
  An aggregate function is an eval function that takes a bag and returns a 
scalar value. One interesting and useful property of many aggregate functions 
is that they can be computed incrementally in a distributed fashion. We call 
these functions `algebraic`. `COUNT` is an example of an algebraic function 
because we can count the number of elements in a subset of the data and then 
sum the counts to produce a final output. In the Hadoop world, this means that 
the partial computations can be done by the map and combiner, and the final 
result can be computed by the reducer.
  
- It is very important for performance to make sure that aggregate functions 
that are algebraic are implemented as such. Let's look at the implementation of 
the COUNT function to see what this means. (Error handling and some other code 
is omitted to save space. The full code can be accessed 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/COUNT.java?view=markup
 here].)
+ It is very important for performance to make sure that aggregate functions 
that are algebraic are implemented as such. Let's look at the implementation of 
the COUNT function to see what this means. (Error handling and some other code 
is omitted to save space. The full code can be accessed 
[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup
 here].)
  
  {{{#!java
  public class COUNT extends EvalFunc<Long> implements Algebraic{
@@ -231, +229 @@

  || bag || !DataBag ||
  || map || Map<Object, Object> ||
  
- All Pig-specific classes are available 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/
 here]
+ All Pig-specific classes are available 
[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/ here]
  
  `Tuple` and `DataBag` are different in that they are not concrete classes but 
rather interfaces. This enables users to extend Pig with their own versions of 
tuples and bags. As a result, UDFs cannot directly instantiate bags or tuples; 
they need to go through factory classes: `TupleFactory` and `BagFactory`.
  
@@ -607, +605 @@

  
  [[Anchor(Load_Functions)]]
  === Load Functions ===
- Every load function needs to implement the `LoadFunc` interface. An 
abbreviated version is shown below. The full definition can be seen 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/LoadFunc.java?view=markup
 here].
+ Every load function needs to implement the `LoadFunc` interface. An 
abbreviated version is shown below. The full definition can be seen 
[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup
 here].
  
  {{{#!java
  public interface LoadFunc {
@@ -641, +639 @@

  
  In this query, only `age` needs to be converted to its actual type (=int=) 
right away. `name` only needs to be converted in the next step of processing 
where the data is likely to be much smaller. `gpa` is not used at all and will 
never need to be converted.
  
- This is the main reason for Pig to separate the reading of the data (which 
can happen immediately) from the converting of the data (to the right type, 
which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` 
that your loader class can extend and will take care of all the conversion 
routines. The code for it can be found 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup
 here].
+ This is the main reason for Pig to separate the reading of the data (which 
can happen immediately) from the converting of the data (to the right type, 
which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` 
that your loader class can extend and will take care of all the conversion 
routines. The code for it can be found 
[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup
 here].
  
  Note that conversion rutines should return null values for data that can't be 
converted to the specified type.
  
@@ -675, +673 @@

  
  Note that this approach assumes that the data has a uniform schema. The 
function needs to make sure that the data it produces conforms to the schema 
returned by `determineSchema`, otherwise the processing will fail. This means 
producing the right number of fields in the tuple (dropping fields or emitting 
null values if needed) and producing fields of the right type (again emitting 
null values as needed).
  
- For complete examples, see 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/BinStorage.java?view=markup
 BinStroage] and 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/PigStorage.java?view=markup
 PigStorage].
+ For complete examples, see 
[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/BinStorage.java?view=markup
 BinStroage] and 
[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/PigStorage.java?view=markup
 PigStorage].
  
  [[Anchor(Store_Functions)]]
  === Store Functions ===

[Pig Wiki] Update of "UDFManual" by OlgaN

Reply via email to