[Pig Wiki] Trivial Update of "UDFManual" by OlgaN

Apache Wiki Thu, 04 Dec 2008 16:17:43 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/UDFManual

------------------------------------------------------------------------------
  }
  }}}
  
- Lines 12-13 create tuple and bag factories respectively. (Factory is a class 
that creates objects of a particular type. For more details, see the definition 
of http://en.wikipedia.org/wiki/Factory_pattern][a factory pattern. The factory 
class itself is implemented as a singleton to guarantee that the same factory 
is used everywhere. For more details see the definition of 
http://en.wikipedia.org/wiki/Singleton_pattern][a singleton pattern.)
+ Lines 12-13 create tuple and bag factories respectively. (Factory is a class 
that creates objects of a particular type. For more details, see the definition 
of [http://en.wikipedia.org/wiki/Factory_pattern a factory pattern]. The 
factory class itself is implemented as a singleton to guarantee that the same 
factory is used everywhere. For more details see the definition of 
[http://en.wikipedia.org/wiki/Singleton_pattern a singleton pattern].)
  
  Line 17 creates a bag using the factory that will contain the output of the 
function. Line 21 creates a tuple for each token and adds it to the output bag.
  
  [[Anchor(Schema)]]
  === Schema ===
  
- The latest version of Pig uses type information for validation and 
performance. It is important for UDFs to participate in type propagation. Until 
now, our UDFs made no effort to communicate their output schema to Pig. This is 
because, most of the time, Pig can figure out this information by using Java's 
http://java.sun.com/developer/technicalArticles/ALT/Reflection/][Reflection. If 
your UDF returns a scalar or a map, no work is required. However, if your UDF 
returns a `tuple` or a `bag` (of tuples), it needs to help Pig figure out the 
structure of the tuple.
+ The latest version of Pig uses type information for validation and 
performance. It is important for UDFs to participate in type propagation. Until 
now, our UDFs made no effort to communicate their output schema to Pig. This is 
because, most of the time, Pig can figure out this information by using Java's 
[http://java.sun.com/developer/technicalArticles/ALT/Reflection/ Reflection]. 
If your UDF returns a scalar or a map, no work is required. However, if your 
UDF returns a `tuple` or a `bag` (of tuples), it needs to help Pig figure out 
the structure of the tuple.
  
  If a UDF returns a `tuple` or a `bag` and schema information is not provided, 
Pig assumes that the tuple contains a single field of type `bytearray`. If this 
is not the case, then not specifying the schema can cause failures. We look at 
this next.
  
@@ -419, +419 @@

  
  There are several types of errors that can occur in a UDF:
  
-    1.. An error that affects a particular row but is not likely to impact 
other rows. An example of such an error would be a malformed input value or 
divide by zero problem. A reasonable handling of this situation would be to 
emit a warning and return a null value. `ABS` function in the next section 
demonstrates this approach. The current approach is to write the warning to 
`stderr`. Eventually we would like to pass a logger to the UDFs. Note that 
returning a NULL value only makes sense if the malformed value is of type 
`bytearray`. Otherwise the proper type has been already created and should have 
an appropriate value. If this is not the case, it is an internal error and 
should cause the system to fail. Both cases can be seen in the implementation 
of the `ABS` function in the next section.
+  1. An error that affects a particular row but is not likely to impact other 
rows. An example of such an error would be a malformed input value or divide by 
zero problem. A reasonable handling of this situation would be to emit a 
warning and return a null value. `ABS` function in the next section 
demonstrates this approach. The current approach is to write the warning to 
`stderr`. Eventually we would like to pass a logger to the UDFs. Note that 
returning a NULL value only makes sense if the malformed value is of type 
`bytearray`. Otherwise the proper type has been already created and should have 
an appropriate value. If this is not the case, it is an internal error and 
should cause the system to fail. Both cases can be seen in the implementation 
of the `ABS` function in the next section.
-    2.. An error that affects the entire processing but can succeed on retry. 
An example of such a failure is the inability to open a lookup file because the 
file could not be found. This could be a temporary environmental issue that can 
go away on retry. A UDF can signal this to Pig by throwing an `IOException` as 
with the case of the `ABS` function below.
+  2. An error that affects the entire processing but can succeed on retry. An 
example of such a failure is the inability to open a lookup file because the 
file could not be found. This could be a temporary environmental issue that can 
go away on retry. A UDF can signal this to Pig by throwing an `IOException` as 
with the case of the `ABS` function below.
-    3.. An error that affects the entire processing and is not likely to 
succeed on retry. An example of such a failure is the inability to open a 
lookup file because of file permission problems. Pig currently does not have a 
way to handle this case. Hadoop does not have a way to handle this case either. 
It will be handled the same way as 2 above.
+  3. An error that affects the entire processing and is not likely to succeed 
on retry. An example of such a failure is the inability to open a lookup file 
because of file permission problems. Pig currently does not have a way to 
handle this case. Hadoop does not have a way to handle this case either. It 
will be handled the same way as 2 above.
  
  Pig provides a helper class `WrappedIOException`. The intent here is to allow 
you to convert any exception into `IOException`. Its usage can be seen in the 
`UPPER` function in our first example.
  
@@ -675, +675 @@

  
  Note that this approach assumes that the data has a uniform schema. The 
function needs to make sure that the data it produces conforms to the schema 
returned by `determineSchema`, otherwise the processing will fail. This means 
producing the right number of fields in the tuple (dropping fields or emitting 
null values if needed) and producing fields of the right type (again emitting 
null values as needed).
  
- For complete examples, see 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/BinStorage.java?view=markup][BinStroage
 and 
http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/PigStorage.java?view=markup][PigStorage.
+ For complete examples, see 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/BinStorage.java?view=markup
 BinStroage] and 
[http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/PigStorage.java?view=markup
 PigStorage].
  
  [[Anchor(Store_Functions)]]
  === Store Functions ===
@@ -718, +718 @@

  
  Comparison UDFs are mostly obsolete now. They were added to the language 
because, at that time, the `ORDER` operator had two significant shortcomings. 
First, it did not allow descending order and, second, it only supported 
alphanumeric order.
  
- The latest version of Pig solves both of these issues. The 
http://wiki.apache.org/pig/UserDefinedOrdering][pointer to the original 
documentation is provided here for completeness.
+ The latest version of Pig solves both of these issues. The 
[http://wiki.apache.org/pig/UserDefinedOrdering pointer] to the original 
documentation is provided here for completeness.
  
  [[Anchor(Builtin_Functions_and_Function_Repositories)]]
  == Builtin Functions and Function Repositories ==

[Pig Wiki] Trivial Update of "UDFManual" by OlgaN

Reply via email to