Author: olga
Date: Sun Dec 6 23:56:50 2009
New Revision: 887804
URL: http://svn.apache.org/viewvc?rev=887804&view=rev
Log:
PIG-1129: Pig UDF doc: fieldsToRead function (chandec via olgan)
Modified:
hadoop/pig/branches/branch-0.6/CHANGES.txt
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml
Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt
URL:
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/CHANGES.txt?rev=887804&r1=887803&r2=887804&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.6/CHANGES.txt (original)
+++ hadoop/pig/branches/branch-0.6/CHANGES.txt Sun Dec 6 23:56:50 2009
@@ -24,6 +24,8 @@
IMPROVEMENTS
+PIG-1129: Pig UDF doc: fieldsToRead function (chandec via olgan)
+
PIG-1084: Pig 0.6.0 Documentation improvements (chandec via olgan)
PIG-978: MQ docs update (chandec via olgan)
Modified:
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml?rev=887804&r1=887803&r2=887804&view=diff
==============================================================================
---
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml
(original)
+++
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml
Sun Dec 6 23:56:50 2009
@@ -752,11 +752,11 @@
public Integer bytesToInteger(byte[] b) throws IOException;
public Long bytesToLong(byte[] b) throws IOException;
......
- public void fieldsToRead(Schema schema);
+ public RequiredFieldResponse fieldsToRead(RequiredFieldList
requiredFieldList) throws FrontendException;
public Schema determineSchema(String fileName, ExecType execType,
DataStorage storage) throws IOException;
</source>
-
+<p><strong>bindTo</strong></p>
<p>The <code>bindTo</code> function is called once by each Pig task before it
starts processing data. It is intended to connect the function to its input. It
provides the following information: </p>
<ul>
<li><p> <code>fileName</code> - The name of the file from which the data is
read. Not used most of the time </p>
@@ -770,7 +770,11 @@
</ul>
<p>In the Hadoop world, the input data is treated as a continuous stream of
bytes. A <code>slicer</code>, discussed in the Advanced Topics section, is used
to split the data into chunks with each chunk going to a particular task for
processing. This chunk is what <code>bindTo</code> provides to the UDF. Note
that unless you use a custom slicer, the default slicer is not aware of tuple
boundaries. This means that the chunk you get can start and end in the middle
of a particular tuple. One common approach is to skip the first partial tuple
and continue past the end position to finish processing a tuple. This is what
<code>PigStorage</code> does as the example later in this section shows. </p>
+
+<p><strong>getNext</strong></p>
<p>The <code>getNext</code> function reads the input stream and constructs the
next tuple. It returns <code>null</code> when it is done with processing and
throws an <code>IOException</code> if it fails to process an input tuple. </p>
+
+<p><strong>conversion routines</strong></p>
<p>Next is a bunch of conversion routines that convert data from
<code>bytearray</code> to the requested type. This requires further
explanation. By default, we would like the loader to do as little per-tuple
processing as possible. This is because many tuples can be thrown out during
filtering or joins. Also, many fields might not get used because they get
projected out. If the data needs to be converted into another form, we would
like this conversion to happen as late as possible. The majority of the loaders
should return the data as bytearrays and the Pig will request a conversion from
bytearray to the actual type when needed. Let's looks at the example below: </p>
<source>
@@ -781,11 +785,32 @@
</source>
<p>In this query, only <code>age</code> needs to be converted to its actual
type (=int=) right away. <code>name</code> only needs to be converted in the
next step of processing where the data is likely to be much smaller.
<code>gpa</code> is not used at all and will never need to be converted. </p>
+
<p>This is the main reason for Pig to separate the reading of the data (which
can happen immediately) from the converting of the data (to the right type,
which can happen later). For ASCII data, Pig provides
<code>Utf8StorageConverter</code> that your loader class can extend and will
take care of all the conversion routines. The code for it can be found <a
href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup">
here</a>. </p>
+
<p>Note that conversion rutines should return null values for data that can't
be converted to the specified type. </p>
+
<p>Loaders that work with binary data like <code>BinStorage</code> are not
going to use this model. Instead, they will produce objects of the appropriate
types. However, they might still need to define conversion routines in case
some of the fields in a tuple are of type <code>bytearray</code>. </p>
-<p><code>fieldsToRead</code> is reserved for future use and should be left
empty. </p>
+
+<p><strong>fieldsToRead</strong></p>
+<p>
+The intent of the <code>fieldsToRead</code> function is to reduce the amount
of data returned from the loader. Pig will evaluate the script and determine
the minimal set of columns needed to execute it. This information will be
passed to the <code>fieldsToRead</code> function of the loader in the
<code>requiredFieldList</code> parameter. The parameter is of type
<code>RequiredFieldList</code> that is defined as part of the
+<a
href="http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/org/apache/pig/LoadFunc.java?view=markup">LoadFunc</a>
interface.
+If the loader chooses not to purge unneeded columns, it can use the following
implementation:
+</p>
+<source>
+public LoadFunc.RequiredFieldResponse fieldsToRead(LoadFunc.RequiredFieldList
requiredFieldList) throws FrontendException {
+ return new LoadFunc.RequiredFieldResponse(false);
+}
+</source>
+
+<p>
+This tells Pig that it should expect the entire column set from the loader. We
expect that most loaders will stick to this implementation. In our tests of
PigStorage, we saw about 5% improvement when selecting 5 columns out of 40. The
loaders that should take advantage of this functionality are the ones, like
Zebra, that can pass this information directly to the storage layer. For an
example of <code>fieldsToRead</code> see the implementation in <a
href="http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/org/apache/pig/builtin/PigStorage.java?view=markup">
PigStorage</a>.
+</p>
+
+<p><strong>determineSchema</strong></p>
<p>The <code>determineSchema</code> function must be implemented by loaders
that return real data types rather than <code>bytearray</code> fields. Other
loaders should just return <code>null</code>. The idea here is that Pig needs
to know the actual types it will be getting; Pig will call
<code>determineSchema</code> on the client side to get this information. The
function is provided as a way to sample the data to determine its schema. </p>
+
<p>Here is the example of the function implemented by =BinStorage=: </p>
<source>