Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/UDFManual ------------------------------------------------------------------------------ An aggregate function is an eval function that takes a bag and returns a scalar value. One interesting and useful property of many aggregate functions is that they can be computed incrementally in a distributed fashion. We call these functions `algebraic`. `COUNT` is an example of an algebraic function because we can count the number of elements in a subset of the data and then sum the counts to produce a final output. In the Hadoop world, this means that the partial computations can be done by the map and combiner, and the final result can be computed by the reducer. - It is very important for performance to make sure that aggregate functions that are algebraic are implemented as such. Let's look at the implementation of the COUNT function to see what this means. (Error handling and some other code is omitted to save space. The full code can be accessed http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/COUNT.java?view=markup][here.) + It is very important for performance to make sure that aggregate functions that are algebraic are implemented as such. Let's look at the implementation of the COUNT function to see what this means. (Error handling and some other code is omitted to save space. The full code can be accessed [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/COUNT.java?view=markup here].) {{{#!java public class COUNT extends EvalFunc<Long> implements Algebraic{ @@ -231, +231 @@ || bag || !DataBag || || map || Map<Object, Object> || - All Pig-specific classes are available http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/][here. + All Pig-specific classes are available [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/ here] `Tuple` and `DataBag` are different in that they are not concrete classes but rather interfaces. This enables users to extend Pig with their own versions of tuples and bags. As a result, UDFs cannot directly instantiate bags or tuples; they need to go through factory classes: `TupleFactory` and `BagFactory`. @@ -607, +607 @@ [[Anchor(Load_Functions)]] === Load Functions === - Every load function needs to implement the `LoadFunc` interface. An abbreviated version is shown below. The full definition can be seen http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/LoadFunc.java?view=markup][here. + Every load function needs to implement the `LoadFunc` interface. An abbreviated version is shown below. The full definition can be seen [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/LoadFunc.java?view=markup here]. {{{#!java public interface LoadFunc { @@ -641, +641 @@ In this query, only `age` needs to be converted to its actual type (=int=) right away. `name` only needs to be converted in the next step of processing where the data is likely to be much smaller. `gpa` is not used at all and will never need to be converted. - This is the main reason for Pig to separate the reading of the data (which can happen immediately) from the converting of the data (to the right type, which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` that your loader class can extend and will take care of all the conversion routines. The code for it can be found http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup.][here. + This is the main reason for Pig to separate the reading of the data (which can happen immediately) from the converting of the data (to the right type, which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` that your loader class can extend and will take care of all the conversion routines. The code for it can be found [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup here]. Note that conversion rutines should return null values for data that can't be converted to the specified type.
