Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/UDFManual ------------------------------------------------------------------------------ } }}} - Lines 12-13 create tuple and bag factories respectively. (Factory is a class that creates objects of a particular type. For more details, see the definition of http://en.wikipedia.org/wiki/Factory_pattern][a factory pattern. The factory class itself is implemented as a singleton to guarantee that the same factory is used everywhere. For more details see the definition of http://en.wikipedia.org/wiki/Singleton_pattern][a singleton pattern.) + Lines 12-13 create tuple and bag factories respectively. (Factory is a class that creates objects of a particular type. For more details, see the definition of [http://en.wikipedia.org/wiki/Factory_pattern a factory pattern]. The factory class itself is implemented as a singleton to guarantee that the same factory is used everywhere. For more details see the definition of [http://en.wikipedia.org/wiki/Singleton_pattern a singleton pattern].) Line 17 creates a bag using the factory that will contain the output of the function. Line 21 creates a tuple for each token and adds it to the output bag. [[Anchor(Schema)]] === Schema === - The latest version of Pig uses type information for validation and performance. It is important for UDFs to participate in type propagation. Until now, our UDFs made no effort to communicate their output schema to Pig. This is because, most of the time, Pig can figure out this information by using Java's http://java.sun.com/developer/technicalArticles/ALT/Reflection/][Reflection. If your UDF returns a scalar or a map, no work is required. However, if your UDF returns a `tuple` or a `bag` (of tuples), it needs to help Pig figure out the structure of the tuple. + The latest version of Pig uses type information for validation and performance. It is important for UDFs to participate in type propagation. Until now, our UDFs made no effort to communicate their output schema to Pig. This is because, most of the time, Pig can figure out this information by using Java's [http://java.sun.com/developer/technicalArticles/ALT/Reflection/ Reflection]. If your UDF returns a scalar or a map, no work is required. However, if your UDF returns a `tuple` or a `bag` (of tuples), it needs to help Pig figure out the structure of the tuple. If a UDF returns a `tuple` or a `bag` and schema information is not provided, Pig assumes that the tuple contains a single field of type `bytearray`. If this is not the case, then not specifying the schema can cause failures. We look at this next. @@ -419, +419 @@ There are several types of errors that can occur in a UDF: - 1.. An error that affects a particular row but is not likely to impact other rows. An example of such an error would be a malformed input value or divide by zero problem. A reasonable handling of this situation would be to emit a warning and return a null value. `ABS` function in the next section demonstrates this approach. The current approach is to write the warning to `stderr`. Eventually we would like to pass a logger to the UDFs. Note that returning a NULL value only makes sense if the malformed value is of type `bytearray`. Otherwise the proper type has been already created and should have an appropriate value. If this is not the case, it is an internal error and should cause the system to fail. Both cases can be seen in the implementation of the `ABS` function in the next section. + 1. An error that affects a particular row but is not likely to impact other rows. An example of such an error would be a malformed input value or divide by zero problem. A reasonable handling of this situation would be to emit a warning and return a null value. `ABS` function in the next section demonstrates this approach. The current approach is to write the warning to `stderr`. Eventually we would like to pass a logger to the UDFs. Note that returning a NULL value only makes sense if the malformed value is of type `bytearray`. Otherwise the proper type has been already created and should have an appropriate value. If this is not the case, it is an internal error and should cause the system to fail. Both cases can be seen in the implementation of the `ABS` function in the next section. - 2.. An error that affects the entire processing but can succeed on retry. An example of such a failure is the inability to open a lookup file because the file could not be found. This could be a temporary environmental issue that can go away on retry. A UDF can signal this to Pig by throwing an `IOException` as with the case of the `ABS` function below. + 2. An error that affects the entire processing but can succeed on retry. An example of such a failure is the inability to open a lookup file because the file could not be found. This could be a temporary environmental issue that can go away on retry. A UDF can signal this to Pig by throwing an `IOException` as with the case of the `ABS` function below. - 3.. An error that affects the entire processing and is not likely to succeed on retry. An example of such a failure is the inability to open a lookup file because of file permission problems. Pig currently does not have a way to handle this case. Hadoop does not have a way to handle this case either. It will be handled the same way as 2 above. + 3. An error that affects the entire processing and is not likely to succeed on retry. An example of such a failure is the inability to open a lookup file because of file permission problems. Pig currently does not have a way to handle this case. Hadoop does not have a way to handle this case either. It will be handled the same way as 2 above. Pig provides a helper class `WrappedIOException`. The intent here is to allow you to convert any exception into `IOException`. Its usage can be seen in the `UPPER` function in our first example. @@ -675, +675 @@ Note that this approach assumes that the data has a uniform schema. The function needs to make sure that the data it produces conforms to the schema returned by `determineSchema`, otherwise the processing will fail. This means producing the right number of fields in the tuple (dropping fields or emitting null values if needed) and producing fields of the right type (again emitting null values as needed). - For complete examples, see http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/BinStorage.java?view=markup][BinStroage and http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/PigStorage.java?view=markup][PigStorage. + For complete examples, see [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/BinStorage.java?view=markup BinStroage] and [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/PigStorage.java?view=markup PigStorage]. [[Anchor(Store_Functions)]] === Store Functions === @@ -718, +718 @@ Comparison UDFs are mostly obsolete now. They were added to the language because, at that time, the `ORDER` operator had two significant shortcomings. First, it did not allow descending order and, second, it only supported alphanumeric order. - The latest version of Pig solves both of these issues. The http://wiki.apache.org/pig/UserDefinedOrdering][pointer to the original documentation is provided here for completeness. + The latest version of Pig solves both of these issues. The [http://wiki.apache.org/pig/UserDefinedOrdering pointer] to the original documentation is provided here for completeness. [[Anchor(Builtin_Functions_and_Function_Repositories)]] == Builtin Functions and Function Repositories ==
