udf.xml

olga Thu, 08 Apr 2010 12:43:28 -0700

Author: olga
Date: Thu Apr  8 19:43:04 2010
New Revision: 932076

URL: http://svn.apache.org/viewvc?rev=932076&view=rev
Log:
PIG-1320: more documentation updates for Pig 0.7.0 (chandec via olgan)


Modified:
    hadoop/pig/branches/branch-0.7/CHANGES.txt
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml

Modified: hadoop/pig/branches/branch-0.7/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/CHANGES.txt?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.7/CHANGES.txt (original)
+++ hadoop/pig/branches/branch-0.7/CHANGES.txt Thu Apr  8 19:43:04 2010
@@ -68,6 +68,8 @@ manner (rding via pradeepkth)
 
 IMPROVEMENTS
 
+PIG-1320: more documentation updates for Pig 0.7.0 (chandec via olgan)
+
 PIG-1316: TextLoader should use Bzip2TextInputFormat for bzip files so that
 bzip files can be efficiently processed by splitting the files (pradeepkth)
 

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
 Thu Apr  8 19:43:04 2010
@@ -272,7 +272,7 @@ STORE D INTO âmysortedcountâ U
 <section>
 <title>Use the LIMIT Operator</title>
 
-<p>A lot of the times, you are not interested in the entire output but either 
a sample or top results. In those cases, using LIMIT can yeild a much better 
performance as we push the limit as high as possible to minimize the amount of 
data travelling through the pipeline. </p>
+<p>Often you are not interested in the entire output but rather a sample or 
top results. In such cases, using LIMIT can yield a much better performance as 
we push the limit as high as possible to minimize the amount of data travelling 
through the pipeline. </p>
 <p>Sample: 
 </p>
 

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
 Thu Apr  8 19:43:04 2010
@@ -6976,7 +6976,7 @@ DUMP B;
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>EXPLAIN [âout path] [-brief] [-dot] [âparam 
param_name = param_value] [âparam_file file_name] alias;Â </para>
+               <para>EXPLAIN [âscript pigscript] [âout path] [âbrief] 
[âdot] [âparam param_name = param_value] [âparam_file file_name] 
alias;Â </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -6985,14 +6985,23 @@ DUMP B;
    <title>Terms</title>
    <informaltable frame="all">
    <tgroup cols="2"><tbody>
-      
          <row>
             <entry>
-               <para>âout path</para>
+               <para>âscript</para>
+            </entry>
+            <entry>
+               <para>Use to specify a pig script.</para>
             </entry>
+         </row>      
+
+         <row>
             <entry>
-               <para>Will generate logical_plan.[txt||dot], 
physical_plan.[text||dot], exec_plan.[text||dot] in the specified directory 
(path).</para>
-               <para>Default (no path given): Stdout </para>
+               <para>âout</para>
+            </entry>
+            <entry>
+               <para>Use to specify the output path (directory).</para>
+               <para>Will generate a logical_plan[.txt|.dot], 
physical_plan[.text|.dot], exec_plan[.text|.dot] file in the specified 
path.</para>
+               <para>Default (no path specified): Stdout </para>
             </entry>
          </row>
 
@@ -7010,9 +7019,10 @@ DUMP B;
                <para>âdot</para>
             </entry>
             <entry>
-               <para>Dot mode: outputs a format that can be passed to dot for 
graphical display.</para>
-               <para>Text mode: multiple output (split) will be broken out in 
sections.  </para>
-               <para>Default: Text </para>
+
+               <para>Text mode (default): multiple output (split) will be 
broken out in sections.  </para>
+               <para>Dot mode: outputs a format that can be passed to the dot 
utility for graphical display â 
+               will generate a directed-acyclic-graph (DAG) of the plans in 
any supported format (.gif, .jpg ...).</para>
             </entry>
          </row>
 
@@ -7295,7 +7305,7 @@ ILLUSTRATE num_user_visits;
                      <para>USING â Keyword.</para>
                   </listitem>
                   <listitem>
-                     <para>serializer â A function that converts data from 
tuples to stream format. PigStorage is the default serializer. You can also 
write your own UDF.</para>
+                     <para>serializer â PigStream is the default serializer. 
</para>
                   </listitem>
                </itemizedlist>
             </entry>
@@ -7318,7 +7328,7 @@ ILLUSTRATE num_user_visits;
                      <para>USING â Keyword.</para>
                   </listitem>
                   <listitem>
-                     <para>deserializer â A function that converts data from 
stream format to tuples. PigStorage is the default deserializer. You can also 
write your own UDF.</para>
+                     <para>deserializer â PigStream is the default 
deserializer. </para>
                   </listitem>
                </itemizedlist>
             </entry>
@@ -7365,7 +7375,7 @@ ILLUSTRATE num_user_visits;
    <para>Use DEFINE to specify a function when:</para>
    <itemizedlist>
       <listitem>
-         <para>The function has a log package name that you don't want to 
include in a script, especially if you call the function several times in that 
script.</para>
+         <para>The function has a long package name that you don't want to 
include in a script, especially if you call the function several times in that 
script.</para>
       </listitem>
       <listitem>
          <para>The constructor for the function takes string parameters. If 
you need to use different constructor parameters for different calls to the 
function you will need to create multiple defines â one for each parameter 
set.</para>
@@ -7375,8 +7385,46 @@ ILLUSTRATE num_user_visits;
    
    <section
    ><title>About Input and Output</title>
-   <para>Serialization is needed to convert data from tuples to a format that 
can be processed by the streaming application. Deserialization is needed to 
convert the output from the streaming application back into tuples.</para>
-   <para>PigStorage, the default serialization/deserialization function, 
converts tuples to tab-delimited lines. Pig's BinarySerializer and 
BinaryDeserializer functions treat the entire file as a byte stream (no 
formatting or interpretation takes place). You can also write your own 
serialization/deserialization functions.</para>
+   <para>Serialization is needed to convert data from tuples to a format that 
can be processed by the streaming application. Deserialization is needed to 
convert the output from the streaming application back into tuples. 
PigStreaming is the default serialization/deserialization function.</para>
+   
+<para>Streaming uses the same default format as PigStorage to 
serialize/deserialize the data. If you want to explicitly specify a format, you 
can do it as show below (see more examples in the Examples: Input/Output 
section).  </para> 
+
+<programlisting>
+DEFINE CMD 'perl PigStreaming.pl - nameMap' input(stdin using 
PigStreaming(',')) output(stdout using PigStreaming(','));
+A = LOAD 'file';
+B = STREAM B THROUGH CMD;
+</programlisting>  
+
+<para>If you need an alternative format, you will need to create a custom 
serializer/deserializer by implementing the following interfaces.</para>
+
+<programlisting>
+interface PigToStream {
+
+        /**
+         * Given a tuple, produce an array of bytes to be passed to the 
streaming
+         * executable.
+         */
+        public byte[] serialize(Tuple t) throws IOException;
+    }
+
+    interface StreamToPig {
+
+        /**
+         *  Given a byte array from a streaming executable, produce a tuple.
+         */
+        public Tuple deserialize(byte[]) throws IOException;
+
+        /**
+         * This will be called on the front end during planning and not on the 
back
+         * end during execution.
+         *
+         * @return the {...@link LoadCaster} associated with this object.
+         * @throws IOException if there is an exception during LoadCaster
+         */
+        public LoadCaster getLoadCaster() throws IOException;
+    }
+</programlisting>  
+   
    </section>
    
    <section>
@@ -7448,15 +7496,15 @@ OP = stream IP through 'perl /a/b/c/scri
       </section>
    
  <section>
- <title>Example: Input/Output</title>
- <para>In this example PigStorage is the default serialization/deserialization 
function. The tuples from relation A are converted to tab-delimited lines that 
are passed to the script.</para>
+ <title>Examples: Input/Output</title>
+ <para>In this example PigStreaming is the default 
serialization/deserialization function. The tuples from relation A are 
converted to tab-delimited lines that are passed to the script.</para>
 <programlisting>
 X = STREAM A THROUGH 'stream.pl';
 </programlisting>
    
-   <para>In this example PigStorage is used as the 
serialization/deserialization function, but a comma is used as the 
delimiter.</para>
+   <para>In this example PigStreaming is used as the 
serialization/deserialization function, but a comma is used as the 
delimiter.</para>
 <programlisting>
-DEFINE Y 'stream.pl' INPUT(stdin USING PigStorage(',')) OUTPUT (stdout USING 
PigStorage(','));
+DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(',')) OUTPUT (stdout USING 
PigStreaming(','));
 
 X = STREAM A THROUGH Y;
 </programlisting>
@@ -7470,7 +7518,7 @@ X = STREAM A THROUGH Y;
    </section>
    
    <section>
-   <title>Example: Ship/Cache</title>
+   <title>Examples: Ship/Cache</title>
    <para>In this example ship is used to send the script to the cluster 
compute nodes.</para>
 <programlisting>
 DEFINE Y 'stream.pl' SHIP('/work/stream.pl');
@@ -7487,7 +7535,7 @@ X = STREAM A THROUGH Y;
    </section>
    
    <section>
-   <title>Example: Logging</title>
+   <title>Examples: Logging</title>
    <para>In this example the streaming stderr is stored in the 
_logs/&lt;dir&gt; directory of the job's output directory. Because the job can 
have multiple streaming applications associated with it, you need to ensure 
that different directory names are used to avoid conflicts. Pig stores up to 
100 tasks per streaming job.</para>
 <programlisting>
 DEFINE Y 'stream.pl' stderr('&lt;dir&gt;' limit 100);
@@ -8590,6 +8638,43 @@ DUMP X;
    
 
    <section>
+   <title>Handling Compression</title>
+
+<para>Support for compression is determined by the load/store function. 
PigStorage and TextLoader support gzip and bzip compression for both read 
(load) and write (store). BinStorage does not support compression.</para>
+
+<para>To work with gzip compressed files, input/output files need to have a 
.gz extension. Gzipped files cannot be split across multiple maps; this means 
that the number of maps created is equal to the number of part files in the 
input location.</para>
+
+<programlisting>
+A = load âmyinput.gzâ;
+store A into âmyoutput.gzâ; 
+</programlisting>
+
+<para>To work with bzip compressed files, the input/output files need to have 
a .bz or .bz2 extension. Because the compression is block-oriented, bzipped 
files can be split across multiple maps.</para>
+
+<programlisting>
+A = load âmyinput.bzâ;
+store A into âmyoutput.bzâ; 
+</programlisting>
+
+<para>Note: PigStorage and TextLoader correctly read compressed files as long 
as they are NOT CONCATENATED FILES generated in this manner: </para>
+  <itemizedlist>
+      <listitem>
+         <para>cat *.gz > text/concat.gz</para>
+      </listitem>
+      <listitem>
+         <para>cat *.bz > text/concat.bz </para>
+      </listitem>
+      <listitem>
+         <para>cat *.bz2 > text/concat.bz2</para>
+      </listitem>
+   </itemizedlist>
+
+<para>If you use concatenated gzip or bzip files with your Pig jobs, you will 
NOT see a failure but the results will be INCORRECT.</para>
+<para></para>
+
+</section>
+
+   <section>
    <title>BinStorage</title>
    <para>Loads and stores data in machine-readable format.</para>
    
@@ -8618,9 +8703,10 @@ DUMP X;
    
    <section>
    <title>Usage</title>
-   <para>BinStorage works with data that is represented on disk in 
machine-readable format.</para>
-   <para>BinStorage does not support compression.</para>
-   <para>BinStorage is used internally by Pig to store the temporary data that 
is created between multiple map/reduce jobs.</para></section>
+   <para>BinStorage works with data that is represented on disk in 
machine-readable format. 
+   BinStorage does NOT support <ulink 
url="#Handling+Compression">compression</ulink>.</para>
+   
+      <para>BinStorage is used internally by Pig to store the temporary data 
that is created between multiple map/reduce jobs.</para></section>
    
    <section>
    <title>Example</title>
@@ -8665,9 +8751,7 @@ STORE X into 'output' USING BinStorage()
    <title>Usage</title>
    <para>PigStorage is the default function for the LOAD and STORE operators 
and works with both simple and complex data types. </para>
    
-   <para>PigStorage supports structured text files (in human-readable UTF-8 
format).</para>
-   
-   <para>PigStorage also supports gzip (.gz) and bzip(.bz or .bz2) compressed 
files. PigStorage correctly reads compressed files as long as they are NOT 
CONCATENATED files generated in this manner: cat *.gz > text/concat.gz  OR cat 
*.bz > text/concat.bz (OR cat *.bz2 > text/concat.bz2). If you use concatenated 
gzip or bzip files with your Pig jobs, you will not see a failure but the 
results will be INCORRECT.</para>
+   <para>PigStorage supports structured text files (in human-readable UTF-8 
format). PigStorage also supports <ulink 
url="#Handling+Compression">compression</ulink>.</para>
 
   <para>Load statements â PigStorage expects data to be formatted using 
field delimiters, either the tab character  ('\t') or other specified 
character.</para>
 
@@ -8762,7 +8846,7 @@ STORE X INTO 'output' USING PigDump();
    
    <section>
    <title>Usage</title>
-   <para>TextLoader works with unstructured data in UTF8 format. Each 
resulting tuple contains a single field with one line of input text. </para>
+   <para>TextLoader works with unstructured data in UTF8 format. Each 
resulting tuple contains a single field with one line of input text. TextLoader 
also supports <ulink url="#Handling+Compression">compression</ulink>.</para>
    <para>Currently, TextLoader support for compression is limited.</para>  
    <para>TextLoader cannot be used to store data.</para>
    </section>

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml 
(original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml 
Thu Apr  8 19:43:04 2010
@@ -762,8 +762,12 @@ has methods to convert byte arrays to sp
 
  <p>The LoadFunc abstract class is the main class to extend for implementing a 
loader. The methods which need to be overriden are explained below:</p>
  <ul>
- <li>getInputFormat() :This method will be called by Pig to get the 
InputFormat used by the loader. The methods in the InputFormat (and underlying 
RecordReader) will be called by pig in the same manner (and in the same 
context) as by Hadoop in a map-reduce java program. If the InputFormat is a 
hadoop packaged one, the implementation should use the new API based one under 
org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be 
implemented using the new API in org.apache.hadoop.mapreduce. If a custom 
loader using a text-based InputFormat or a file based InputFormat would like to 
read files in all subdirectories under a given input directory recursively, 
then it should use the PigFileInputFormat and PigTextInputFormat classes 
provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. This 
is to work around the current limitation in Hadoop's TextInputFormat and 
FileInputFormat which only read one level down from provided input directory. 
So for ex
 ample if the input in the load statement is 'dir1' and there are subdirs 
'dir2' and 'dir2/dir3' underneath dir1, using Hadoop's TextInputFormat or 
FileInputFormat only files under 'dir1' can be read. Using PigFileInputFormat 
or PigTextInputFormat (or by extending them), files in all the directories can 
be read. </li>
+ <li>getInputFormat() :This method is called by Pig to get the InputFormat 
used by the loader. The methods in the InputFormat (and underlying 
RecordReader) are called by Pig in the same manner (and in the same context) as 
by Hadoop in a MapReduce java program. If the InputFormat is a Hadoop packaged 
one, the implementation should use the new API based one under 
org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be 
implemented using the new API in org.apache.hadoop.mapreduce.<br></br> 
<br></br> 
+ 
+ If a custom loader using a text-based InputFormat or a file-based InputFormat 
would like to read files in all subdirectories under a given input directory 
recursively, then it should use the PigTextInputFormat and PigFileInputFormat 
classes provided in 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. The Pig 
InputFormat classes work around a current limitation in the Hadoop 
TextInputFormat and FileInputFormat classes which only read one level down from 
the provided input directory. For example, if the input in the load statement 
is 'dir1' and there are subdirs 'dir2' and 'dir2/dir3' beneath dir1, the Hadoop 
TextInputFormat and FileInputFormat classes read the files under 'dir1' only. 
Using PigTextInputFormat or PigFileInputFormat (or by extending them), the 
files in all the directories can be read. </li>
+ 
  <li>setLocation() :This method is called by Pig to communicate the load 
location to the loader. The loader should use this method to communicate the 
same information to the underlying InputFormat. This method is called multiple 
times by pig - implementations should bear this in mind and should ensure there 
are no inconsistent side effects due to the multiple calls. </li>
+ 
  <li>prepareToRead() : Through this method the RecordReader associated with 
the InputFormat provided by the LoadFunc is passed to the LoadFunc. The 
RecordReader can then be used by the implementation in getNext() to return a 
tuple representing a record of data back to pig. </li>
  <li>getNext() :The meaning of getNext() has not changed and is called by Pig 
runtime to get the next tuple in the data - in this method the implementation 
should use the the underlying RecordReader and construct the tuple to return. 
</li>
  </ul>
@@ -1124,13 +1128,6 @@ public class SimpleTextStorer extends St
 </section>
 <!-- END LOAD/STORE FUNCTIONS -->
 
-<section>
-<title> Comparison Functions</title>
-
-<p>Comparison UDFs are mostly obsolete now. They were added to the language 
because, at that time, the <code>ORDER</code> operator had two significant 
shortcomings. First, it did not allow descending order and, second, it only 
supported alphanumeric order. </p>
-<p>The latest version of Pig solves both of these issues. The <a 
href="http://wiki.apache.org/pig/UserDefinedOrdering";> pointer</a> to the 
original documentation is provided here for completeness. </p>
-
-</section>
 
 <section>
 <title>Builtin Functions and Function Repositories</title>

svn commit: r932076 - in /hadoop/pig/branches/branch-0.7: CHANGES.txt src/docs/src/documentation/content/xdocs/cookbook.xml src/docs/src/documentation/content/xdocs/piglatin_ref2.xml src/docs/src/documentation/content/xdocs/udf.xml

Reply via email to