[Pig Wiki] Trivial Update of "UDFManual" by OlgaN

Apache Wiki Thu, 04 Dec 2008 15:55:07 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/UDFManual

------------------------------------------------------------------------------
  
  An aggregate function is an eval function that takes a bag and returns a 
scalar value. One interesting and useful property of many aggregate functions 
is that they can be computed incrementally in a distributed fashion. We call 
these functions `algebraic`. `COUNT` is an example of an algebraic function 
because we can count the number of elements in a subset of the data and then 
sum the counts to produce a final output. In the Hadoop world, this means that 
the partial computations can be done by the map and combiner, and the final 
result can be computed by the reducer.
  
- It is very important for performance to make sure that aggregate functions 
that are algebraic are implemented as such. Let's look at the implementation of 
the COUNT function to see what this means. (Error handling and some other code 
is omitted to save space. The full code can be accessed 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/COUNT.java?view=markup][here.)
+ It is very important for performance to make sure that aggregate functions 
that are algebraic are implemented as such. Let's look at the implementation of 
the COUNT function to see what this means. (Error handling and some other code 
is omitted to save space. The full code can be accessed 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/COUNT.java?view=markup
 here].)
  
  {{{#!java
  public class COUNT extends EvalFunc<Long> implements Algebraic{
@@ -231, +231 @@

  || bag || !DataBag ||
  || map || Map<Object, Object> ||
  
- All Pig-specific classes are available 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/][here.
 
+ All Pig-specific classes are available 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/
 here]
  
  `Tuple` and `DataBag` are different in that they are not concrete classes but 
rather interfaces. This enables users to extend Pig with their own versions of 
tuples and bags. As a result, UDFs cannot directly instantiate bags or tuples; 
they need to go through factory classes: `TupleFactory` and `BagFactory`.
  
@@ -607, +607 @@

  
  [[Anchor(Load_Functions)]]
  === Load Functions ===
- Every load function needs to implement the `LoadFunc` interface. An 
abbreviated version is shown below. The full definition can be seen 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/LoadFunc.java?view=markup][here.
+ Every load function needs to implement the `LoadFunc` interface. An 
abbreviated version is shown below. The full definition can be seen 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/LoadFunc.java?view=markup
 here].
  
  {{{#!java
  public interface LoadFunc {
@@ -641, +641 @@

  
  In this query, only `age` needs to be converted to its actual type (=int=) 
right away. `name` only needs to be converted in the next step of processing 
where the data is likely to be much smaller. `gpa` is not used at all and will 
never need to be converted.
  
- This is the main reason for Pig to separate the reading of the data (which 
can happen immediately) from the converting of the data (to the right type, 
which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` 
that your loader class can extend and will take care of all the conversion 
routines. The code for it can be found 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup.][here.
+ This is the main reason for Pig to separate the reading of the data (which 
can happen immediately) from the converting of the data (to the right type, 
which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` 
that your loader class can extend and will take care of all the conversion 
routines. The code for it can be found 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup
 here].
  
  Note that conversion rutines should return null values for data that can't be 
converted to the specified type.

[Pig Wiki] Trivial Update of "UDFManual" by OlgaN

Reply via email to