Author: olga
Date: Thu Nov 12 18:43:45 2009
New Revision: 835496

URL: http://svn.apache.org/viewvc?rev=835496&view=rev
Log:
PIG-1089: Pig 0.6.0 Documentation  (chandec via olgan)

Modified:
    hadoop/pig/trunk/CHANGES.txt
    
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml

Modified: hadoop/pig/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/CHANGES.txt?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/CHANGES.txt (original)
+++ hadoop/pig/trunk/CHANGES.txt Thu Nov 12 18:43:45 2009
@@ -26,6 +26,8 @@
 
 IMPROVEMENTS
 
+PIG-1089: Pig 0.6.0 Documentation  (chandec via olgan)
+
 PIG-958: Splitting output data on key field (ankur via pradeepkth)
 
 PIG-1058: FINDBUGS: remaining "Correctness Warnings" (olgan)

Modified: 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
 (original)
+++ 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
 Thu Nov 12 18:43:45 2009
@@ -5412,7 +5412,7 @@
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = GROUP alias { ALL | BY expression} [, alias ALL 
| BY expression …] [PARALLEL n];</para>
+               <para>alias = GROUP alias { ALL | BY expression} [, alias ALL 
| BY expression …]  [USING "collected"] [PARALLEL n];</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5454,6 +5454,27 @@
                <para>A tuple expression. This is the group key or key field. 
If the result of the tuple expression is a single field, the key will be the 
value of the first field rather than a tuple with one field.</para>
             </entry>
          </row>
+         
+         <row>
+            <entry>
+               <para>USING</para>
+            </entry>
+            <entry>
+               <para>Keyword</para>
+            </entry>
+         </row>
+         <row>
+            <entry>
+               <para>"collected"</para>
+            </entry>
+            <entry>
+               <para>Allows for more efficient computation of a group if the 
loader guarantees that the data for the 
+               same key is continuous and is given to a single map. As of this 
release, only the Zebra loader makes this 
+               guarantee. The efficiency is achieved by performing the group 
operation in map
+               rather than reduce (see <ulink 
url="piglatin_users.html#Integration+with+Zebra">Integration with 
Zebra</ulink>). This feature cannot be used with the COGROUP operator.</para>
+            </entry>
+         </row>         
+         
 
          <row>
             <entry>
@@ -5553,6 +5574,10 @@
 (19,{(Mary)})
 (20,{(Bill)})
 </programlisting>
+
+   </section>
+   <section>
+   <title>Example</title>
    
    <para>Suppose we have relation A.</para>
 <programlisting>
@@ -5629,6 +5654,18 @@
 </programlisting>
    
    </section>
+   
+   <section>
+   <title>Example</title>
+<para>This example shows a map-side group.</para>   
+<programlisting>
+ register zebra.jar;
+ A = LOAD 'studentsortedtab' USING 
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa’, 'sorted');
+ B = GROUP A BY name USING "collected";
+ C = FOREACH b GENERATE group, MAX(a.age), COUNT_STAR(a);
+</programlisting>
+    </section>
+
    </section>
    
    <section>
@@ -5793,7 +5830,8 @@
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = JOIN left-alias BY left-alias-column 
[LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [PARALLEL n];  
</para>
+               <para>alias = JOIN left-alias BY left-alias-column 
[LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column 
+               [USING "replicated" | "skewed"] [PARALLEL n];  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable>
@@ -5865,6 +5903,34 @@
             </entry>
          </row>
 
+  <row>
+            <entry>
+               <para>USING</para>
+            </entry>
+            <entry>
+               <para>Keyword</para>
+            </entry>
+         </row>
+         <row>
+            <entry>
+               <para>"replicated"</para>
+            </entry>
+            <entry>
+               <para>Use to perform fragment replicate joins (see <ulink 
url="piglatin_users.html#Fragment+Replicate+Joins">Fragment Replicate 
Joins</ulink>).</para>
+               <para>Only left outer join is supported for replicated outer 
join.</para>
+            </entry>
+         </row>
+         
+                  <row>
+            <entry>
+               <para>"skewed"</para>
+            </entry>
+            <entry>
+               <para>Use to perform skewed joins (see <ulink 
url="piglatin_users.html#Skewed+Joins">Skewed Joins</ulink>).</para>
+            </entry>
+         </row>
+
+
          <row>
             <entry>
                <para>PARALLEL n</para>
@@ -5929,6 +5995,21 @@
 B = LOAD 'b.txt' AS (n:chararray, m:chararray);
 C = JOIN A BY $0 FULL, B BY $0;
 </programlisting>
+
+<para>This example shows a replicated left outer join.</para>
+<programlisting>
+A = LOAD ‘large’;
+B = LOAD ‘tiny’;
+C= JOIN A BY $0 LEFT, B BY $0 USING "replicated";
+</programlisting>
+
+<para>This example shows a skewed full outer join.</para>
+<programlisting>
+A = LOAD  ‘studenttab’ as (name, age, gpa);
+B = LOAD  'votertab' as (name, age, registration, contribution);
+C = JOIN A BY name FULL, B BY name USING "skewed";
+</programlisting>
+
 </section>
 </section>  
   
@@ -8739,12 +8820,78 @@
 </programlisting>
    </section></section></section>
    
+      <!-- Shell COMMANDS-->
+   <section>
+   <title>Shell Commands</title>
+   
+      <section>
+   <title>fs</title>
+   <para>Invokes any FSShell command from within a Pig script or the Grunt 
shell.</para>
+   
+   <section>
+   <title>Syntax </title>
+   <informaltable frame="all">
+      <tgroup cols="1"><tbody><row>
+            <entry>
+               <para>fs subcommand subcommand_parameters </para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Terms</title>
+   <informaltable frame="all">
+      <tgroup cols="2">
+      <tbody>
+      <row>
+            <entry>
+               <para>subcommand</para>
+            </entry>
+            <entry>
+               <para>The FSShell command.</para>
+            </entry>
+         </row>
+               <row>
+            <entry>
+               <para>subcommand_parameters</para>
+            </entry>
+            <entry>
+               <para>The FSShell command parameters.</para>
+            </entry>
+         </row>
+         </tbody>
+         </tgroup>
+   </informaltable>
    
+   </section>
    
+   <section>
+   <title>Usage</title>
+   <para>Use the fs command to invoke any FSShell command from within a Pig 
script or Grunt shell. 
+   The fs command greatly extends the set of supported file system commands 
and the capabilities
+   supported for existing commands such as ls that will now support globing. 
For a complete list of
+   FSShell commands, see 
+   <ulink 
url="http://hadoop.apache.org/common/docs/current/hdfs_shell.html";>HDFS File 
System Shell Guide</ulink></para>
+   </section>
+   
+   <section>
+   <title>Examples</title>
+   <para>In these examples, a directory is created, a file is copied, a file 
is listed.</para>
+<programlisting>
+fs -mkdir /tmp
+fs -copyFromLocal file-x file-y
+fs -ls file-y
+</programlisting>
+   </section>
+    </section>
+        </section>
+    
    <!-- FILE COMMANDS-->
    <section>
    <title>File Commands</title>
-
+   <para>Note: Beginning with Pig 0.6.0, the file commands are now deprecated 
and will be removed in a future release. 
+   Start using Pig's -fs command to invoke the shell commands <ulink 
url="piglatin_reference.html#Shell+Commands">shell commands</ulink>.   
+   </para>
    <section>
    <title>cat</title>
    <para>Prints the content of one or more files to the screen.</para>
@@ -8786,7 +8933,8 @@
 john adams
 anne white
 </programlisting>
-   </section></section>
+   </section>
+   </section>
    
    <section>
    <title>cd</title>
@@ -8984,96 +9132,7 @@
 </programlisting>
    </section></section>
    
-   <section>
-   <title>exec</title>
-   <para>Run a Pig script.</para>
-   
-   <section>
-   <title>Syntax</title>
-   <informaltable frame="all">
-      <tgroup cols="1"><tbody><row>
-            <entry>
-               <para>exec [–param param_name = param_value] [–param_file 
file_name] script  </para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Terms</title>
-   <informaltable frame="all">
-   <tgroup cols="2"><tbody>
-        <row>
-            <entry>
-               <para>–param param_name = param_value</para>
-            </entry>
-            <entry>
-               <para>See Parameter Substitution.</para>
-            </entry>
-        </row>
-
-        <row>
-            <entry>
-               <para>–param_file file_name</para>
-            </entry>
-            <entry>
-               <para>See Parameter Substitution. </para>
-            </entry>
-        </row>
-   
-      <row>
-            <entry>
-               <para>script</para>
-            </entry>
-            <entry>
-               <para>The name of a Pig script.</para>
-            </entry>
-         </row>
-         
-   </tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Usage</title>
-   <para>Use the exec command to run a Pig script with no interaction between 
the script and the Grunt shell (batch mode). Aliases defined in the script are 
not available to the shell; however, the files produced as the output of the 
script and stored on the system are visible after the script is run. Aliases 
defined via the shell are not available to the script. </para>
-   <para>With the exec command, store statements will not trigger execution; 
rather, the entire script is parsed before execution starts. Unlike the run 
command, exec does not change the command history or remembers the handles used 
inside the script. Exec without any parameters can be used in scripts to force 
execution up to the point in the script where the exec occurs. </para>
-   <para>For comparison, see the run command. Both the exec and run commands 
are useful for debugging because you can modify a Pig script in an editor and 
then rerun the script in the Grunt shell without leaving the shell. Also, both 
commands promote Pig script modularity as they allow you to reuse existing 
components.</para>
-   </section>
-   
-   <section>
-   <title>Examples</title>
-   <para>In this example the script is displayed and run.</para>
-
-<programlisting>
-grunt&gt; cat myscript.pig
-a = LOAD 'student' AS (name, age, gpa);
-b = LIMIT a 3;
-DUMP b;
-
-grunt&gt; exec myscript.pig
-(alice,20,2.47)
-(luke,18,4.00)
-(holly,24,3.27)
-</programlisting>
-
-   <para>In this example parameter substitution is used with the exec 
command.</para>
-<programlisting>
-grunt&gt; cat myscript.pig
-a = LOAD 'student' AS (name, age, gpa);
-b = ORDER a BY name;
-
-STORE b into '$out';
-
-grunt&gt; exec –param out=myoutput myscript.pig
-</programlisting>
-
-      <para>In this example multiple parameters are specified.</para>
-<programlisting>
-grunt&gt; exec –param p1=myparam1 –param p2=myparam2 myscript.pig
-</programlisting>
-
-   </section>
-   
-   </section>
+ 
    
    <section>
    <title>ls</title>
@@ -9343,8 +9402,15 @@
 </programlisting>
    </section></section>
    
+
+   </section>
+   
+   
    <section>
-   <title>run</title>
+   <title>Utility Commands</title>
+   
+  <section>
+   <title>exec</title>
    <para>Run a Pig script.</para>
    
    <section>
@@ -9352,7 +9418,7 @@
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>run [–param param_name = param_value] [–param_file 
file_name] script </para>
+               <para>exec [–param param_name = param_value] [–param_file 
file_name] script  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -9361,23 +9427,24 @@
    <title>Terms</title>
    <informaltable frame="all">
    <tgroup cols="2"><tbody>
-         <row>
+        <row>
             <entry>
                <para>–param param_name = param_value</para>
             </entry>
             <entry>
                <para>See Parameter Substitution.</para>
             </entry>
-         </row>
+        </row>
 
-         <row>
+        <row>
             <entry>
                <para>–param_file file_name</para>
             </entry>
             <entry>
                <para>See Parameter Substitution. </para>
             </entry>
-         </row>
+        </row>
+   
       <row>
             <entry>
                <para>script</para>
@@ -9392,49 +9459,47 @@
    
    <section>
    <title>Usage</title>
-   <para>Use the run command to run a Pig script that can interact with the 
Grunt shell (interactive mode). The script has access to aliases defined 
externally via the Grunt shell. The Grunt shell has access to aliases defined 
within the script. All commands from the script are visible in the command 
history. </para>   
-       <para>With the run command, every store triggers execution. The 
statements from the script are put into the command history and all the aliases 
defined in the script can be referenced in subsequent statements after the run 
command has completed. Issuing a run command on the grunt command line has 
basically the same effect as typing the statements manually. </para>   
-   <para>For comparison, see the exec command. Both the run and exec commands 
are useful for debugging because you can modify a Pig script in an editor and 
then rerun the script in the Grunt shell without leaving the shell. Also, both 
commands promote Pig script modularity as they allow you to reuse existing 
components.</para>
-  </section>
+   <para>Use the exec command to run a Pig script with no interaction between 
the script and the Grunt shell (batch mode). Aliases defined in the script are 
not available to the shell; however, the files produced as the output of the 
script and stored on the system are visible after the script is run. Aliases 
defined via the shell are not available to the script. </para>
+   <para>With the exec command, store statements will not trigger execution; 
rather, the entire script is parsed before execution starts. Unlike the run 
command, exec does not change the command history or remembers the handles used 
inside the script. Exec without any parameters can be used in scripts to force 
execution up to the point in the script where the exec occurs. </para>
+   <para>For comparison, see the run command. Both the exec and run commands 
are useful for debugging because you can modify a Pig script in an editor and 
then rerun the script in the Grunt shell without leaving the shell. Also, both 
commands promote Pig script modularity as they allow you to reuse existing 
components.</para>
+   </section>
    
    <section>
-   <title>Example</title>
-   <para>In this example the script interacts with the results of commands 
issued via the Grunt shell.</para>
+   <title>Examples</title>
+   <para>In this example the script is displayed and run.</para>
+
 <programlisting>
 grunt&gt; cat myscript.pig
-b = ORDER a BY name;
-c = LIMIT b 10;
-
-grunt&gt; a = LOAD 'student' AS (name, age, gpa);
-
-grunt&gt; run myscript.pig
-
-grunt&gt; d = LIMIT c 3;
+a = LOAD 'student' AS (name, age, gpa);
+b = LIMIT a 3;
+DUMP b;
 
-grunt&gt; DUMP d;
+grunt&gt; exec myscript.pig
 (alice,20,2.47)
-(alice,27,1.95)
-(alice,36,2.27)
+(luke,18,4.00)
+(holly,24,3.27)
 </programlisting>
-   
-   
-   <para>In this example parameter substitution is used with the run 
command.</para>
-<programlisting>
-grunt&gt; a = LOAD 'student' AS (name, age, gpa);
 
+   <para>In this example parameter substitution is used with the exec 
command.</para>
+<programlisting>
 grunt&gt; cat myscript.pig
+a = LOAD 'student' AS (name, age, gpa);
 b = ORDER a BY name;
+
 STORE b into '$out';
 
-grunt&gt; run –param out=myoutput myscript.pig
+grunt&gt; exec –param out=myoutput myscript.pig
 </programlisting>
-   
-   </section></section>
+
+      <para>In this example multiple parameters are specified.</para>
+<programlisting>
+grunt&gt; exec –param p1=myparam1 –param p2=myparam2 myscript.pig
+</programlisting>
+
    </section>
    
+   </section>   
    
-   <section>
-   <title>Utility Commands</title>
    
    <section>
    <title>help</title>
@@ -9557,6 +9622,97 @@
 </programlisting>
    </section></section>
    
+   
+   <section>
+   <title>run</title>
+   <para>Run a Pig script.</para>
+   
+   <section>
+   <title>Syntax</title>
+   <informaltable frame="all">
+      <tgroup cols="1"><tbody><row>
+            <entry>
+               <para>run [–param param_name = param_value] [–param_file 
file_name] script </para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Terms</title>
+   <informaltable frame="all">
+   <tgroup cols="2"><tbody>
+         <row>
+            <entry>
+               <para>–param param_name = param_value</para>
+            </entry>
+            <entry>
+               <para>See Parameter Substitution.</para>
+            </entry>
+         </row>
+
+         <row>
+            <entry>
+               <para>–param_file file_name</para>
+            </entry>
+            <entry>
+               <para>See Parameter Substitution. </para>
+            </entry>
+         </row>
+      <row>
+            <entry>
+               <para>script</para>
+            </entry>
+            <entry>
+               <para>The name of a Pig script.</para>
+            </entry>
+         </row>
+         
+   </tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Usage</title>
+   <para>Use the run command to run a Pig script that can interact with the 
Grunt shell (interactive mode). The script has access to aliases defined 
externally via the Grunt shell. The Grunt shell has access to aliases defined 
within the script. All commands from the script are visible in the command 
history. </para>   
+       <para>With the run command, every store triggers execution. The 
statements from the script are put into the command history and all the aliases 
defined in the script can be referenced in subsequent statements after the run 
command has completed. Issuing a run command on the grunt command line has 
basically the same effect as typing the statements manually. </para>   
+   <para>For comparison, see the exec command. Both the run and exec commands 
are useful for debugging because you can modify a Pig script in an editor and 
then rerun the script in the Grunt shell without leaving the shell. Also, both 
commands promote Pig script modularity as they allow you to reuse existing 
components.</para>
+  </section>
+   
+   <section>
+   <title>Example</title>
+   <para>In this example the script interacts with the results of commands 
issued via the Grunt shell.</para>
+<programlisting>
+grunt&gt; cat myscript.pig
+b = ORDER a BY name;
+c = LIMIT b 10;
+
+grunt&gt; a = LOAD 'student' AS (name, age, gpa);
+
+grunt&gt; run myscript.pig
+
+grunt&gt; d = LIMIT c 3;
+
+grunt&gt; DUMP d;
+(alice,20,2.47)
+(alice,27,1.95)
+(alice,36,2.27)
+</programlisting>
+   
+   
+   <para>In this example parameter substitution is used with the run 
command.</para>
+<programlisting>
+grunt&gt; a = LOAD 'student' AS (name, age, gpa);
+
+grunt&gt; cat myscript.pig
+b = ORDER a BY name;
+STORE b into '$out';
+
+grunt&gt; run –param out=myoutput myscript.pig
+</programlisting>
+   
+   </section></section>   
+   
+   
+   
    <section>
    <title>set</title>
    <para>Assigns values to keys used in Pig.</para>

Modified: 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml 
(original)
+++ 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml 
Thu Nov 12 18:43:45 2009
@@ -158,12 +158,12 @@
    <title>Increasing Parallelism</title>
    <p>To increase the parallelism of a job, include the PARALLEL clause with 
the COGROUP, CROSS, DISTINCT, GROUP, JOIN and ORDER operators. 
    PARALLEL controls the number of reducers only; the number of maps is 
determined by the input data 
-   (see the <a href="http://wiki.apache.org/pig/PigUserCookbook";>Pig User 
Cookbook</a>).</p>
+   (see the <a href="cookbook.html">Pig Cookbook</a>).</p>
    </section>
    
    <section><title>Increasing Performance</title>
    <p>You can increase or optimize the performance of your Pig Latin scripts 
by following a few simple rules 
-   (see the <a href="http://wiki.apache.org/pig/PigUserCookbook";>Pig User 
Cookbook</a>).</p>
+   (see the <a href="cookbook.html">Pig Cookbook</a>).</p>
    </section>
    
    <section>
@@ -420,8 +420,8 @@
 <title>Specialized Joins</title>
 <p>
 Pig Latin includes three "specialized" joins: fragement replicate joins, 
skewed joins, and merge joins. 
-These joins are performed using the <a 
href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins).
-Currently, these joins <strong>cannot</strong> be performed using outer joins.
+Replicate, skewed, and merge joins can be performed using the <a 
href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins).
+Replicate and skewed joins can be performed using the the the <a 
href="piglatin_reference.html#JOIN%2C+OUTER">Outer Join</a> syntax.
 </p>
 
 <!-- FRAGMENT REPLICATE JOINS-->
@@ -434,7 +434,7 @@
  
 <section>
 <title>Usage</title>
-<p>Perform a fragment replicate join with the USING clause (see the <a 
href="piglatin_reference.html#JOIN">JOIN</a> operator).
+<p>Perform a fragment replicate join with the USING clause (see <a 
href="piglatin_reference.html#JOIN">JOIN</a> and <a 
href="piglatin_reference.html#JOIN%2C+OUTER">JOIN, OUTER</a>).
 In this example, a large relation is joined with two smaller relations. Note 
that the large relation comes first followed by the smaller relations; 
 and, all small relations together must fit into main memory, otherwise an 
error is generated. </p>
 <source>
@@ -478,11 +478,11 @@
 
 <section>
 <title>Usage</title>
-<p>Perform a skewed join with the USING clause (see the <a 
href="piglatin_reference.html#JOIN">JOIN</a> operator). </p>
+<p>Perform a skewed join with the USING clause (see <a 
href="piglatin_reference.html#JOIN">JOIN</a> and <a 
href="piglatin_reference.html#JOIN%2C+OUTER">JOIN, OUTER</a>). </p>
 <source>
 big = LOAD 'big_data' AS (b1,b2,b3);
 massive = LOAD 'massive_data' AS (m1,m2,m3);
-c = JOIN big BY b1, massive BY m1 USING "skewed";
+C = JOIN big BY b1, massive BY m1 USING "skewed";
 </source>
 </section>
 
@@ -683,9 +683,92 @@
 D = JOIN C BY $1, B BY $1;
 </source>
 </section>
+</section> <!-- END OPTIMIZATION RULES -->
+
+ <!-- MEMORY MANAGEMENT -->
+<section>
+<title>Memory Management</title>
+<p>For Pig 0.6.0 we changed how Pig decides when to spill bags to disk. In the 
past, Pig tried to figure out when an application was getting close to memory 
limit and then spill at that time. However, because Java does not include an 
accurate way to determine when to spill, Pig often ran out of memory. </p>
 
+<p>In the current version, we allocate a fix amount of memory to store bags 
and spill to disk as soon as the memory limit is reached. This is very similar 
to how Hadoop decides when to spill data accumulated by the combiner. </p>
 
-</section> <!-- END OPTIMIZATION RULES -->
+<p>The amount of memory allocated to bags is determined by 
pig.cachedbag.memusage; the default is set to 10% of available memory. Note 
that this memory is shared across all large bags used by the application.</p>
+
+</section> <!-- END MEMORY MANAGEMENT  -->
+
+ <!-- ZEBRA INTEGRATION -->
+<section>
+<title>Integration with Zebra</title>
+ <p>This version of Pig is integrated with Zebra storage format. Zebra is a 
recent contrib project of Pig and the details can be found at 
http://wiki.apache.org/pig/zebra. Pig can now: </p>
+ <ul>
+ <li>Load data in Zebra format</li>
+  <li>Take advantage of sorted Zebra tables in case of map-side group and 
merge join.</li>
+  <li>Store data in Zebra format</li>
+ </ul>
+ <p></p>
+ <p>To load data in Zebra format using TableLoader, do the following:</p>
+ <source>
+register /zebra.jar;
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
+B = FOREACH A GENERATE name, age, gpa;
+</source>
+  
+ <p>There are a couple of things to note:</p>
+ <ol>
+ <li>You need to register a Aebra jar file the same way you would do it for 
any other UDF.</li>
+ <li>You need to place the jar on your classpath.</li>
+ <li>Zebra data is self-described and always contains schema. This means that 
the AS clause is unnecessary as long as 
+  you know what the column names and types are. To determine the column names 
and types, you can run the DESCRIBE statement right after the load:
+ <source>
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
+DESCRIBE A;
+a: {name: chararray,age: int,gpa: float}
+</source>
+ </li>
+ </ol>
+   
+<p>You can provide alternative names to the columns with AS clause. You can 
also provide types as long as the 
+ original type can be converted to the new type.</p>
+ 
+<p>You can provide multiple, comma-separated files to the loader:</p>
+<source>
+A = LOAD 'studenttab, votertab' USING 
org.apache.hadoop.zebra.pig.TableLoader();
+</source>
+
+<p>TableLoader supports efficient column selection. The current version of Pig 
does not support automatically pushing 
+ projections down to the loader. (The work is in progress and will be done 
after beta.) 
+ Meanwhile, the loader allows passing columns down via a list of arguments. 
This example tells the loader to only return two columns, name and age.</p>
+<source>
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader('name, 
age');
+</source>
+
+<p>If the input data is globally sorted, map-side group or merge join can be 
used. Please, notice the “sorted” argument passed to the loader. This lets 
the loader know that the data is expected to be globally sorted and that a 
single key must be given to the same map.</p>
+
+<p>Here is an example of the merge join. Note that the first argument to the 
loader is left empty to indicate that all columns are requested.</p>
+<source>
+A = LOAD'studentsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 
'sorted');
+B = LOAD 'votersortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 
'sorted');
+G = JOIN A BY $0, B By $0 USING "merge";
+</source>
+
+<p>Here is an example of a map-side group. Note that multiple sorted files are 
passed to the loader and that the loader will perform sort preserving merge to 
make sure that the data is globally sorted.</p>
+<source>
+A = LOAD 'studentsortedtab, studentnullsortedtab' using 
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 
'sorted');
+B = GROUP A BY $0 USING "collected";
+C = FOREACH B GENERATE group, MAX(a.$1);
+</source>
+
+<p>You can also write data in Zebra format. Note that, since Zebra requires a 
schema to be stored with the data, the relation that is stored must have a name 
assigned (via alias) to every column in the relation.</p>
+<source>
+A = LOAD 'studentsortedtab, studentnullsortedtab' USING 
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 
'sorted');
+B = GROUP A BY $0 USING "collected";
+C = FOREACH B GENERATE group, MAX(a.$1) AS max_val;
+STORE C INTO 'output' USING org.apache.hadoop.zebra.pig.TableStorer('');
+</source>
+
+ </section> <!-- END ZEBRA INTEGRATION  -->
+ 
+ 
  
  </body>
  </document>

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml 
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml Thu Nov 
12 18:43:45 2009
@@ -32,6 +32,6 @@
   -->
   <tab label="Project" href="http://hadoop.apache.org/pig/"; type="visible" /> 
   <tab label="Wiki" href="http://wiki.apache.org/pig/"; type="visible" /> 
-  <tab label="Pig 0.5.0 Documentation" dir="" type="visible" /> 
+  <tab label="Pig 0.6.0 Documentation" dir="" type="visible" /> 
 
 </tabs>

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml Thu Nov 
12 18:43:45 2009
@@ -866,8 +866,89 @@
 </section>
 
 <section>
-<title>Advanced Topics</title>
+<title>Accumulate Interface</title>
+
+<p>In Pig, problems with memory usage can occur when data, which results from 
a group or cogroup operation, needs to be placed in a bag  and passed in its 
entirety to a UDF.</p>
+
+<p>This problem is partially addressed by Algebraic UDFs that use the combiner 
and can deal with data being passed to them incrementally during different 
processing phases (map, combiner, and reduce.) However, there are a number of 
UDFs that are not Algebraic, don't use the combiner, but still don’t need to 
be given all data at once. </p>
+
+<p>The new Accumulator interface is designed to decrease memory usage by 
targeting such UDFs. For the functions that implement this interface, Pig 
guarantees that the data for the same key is passed continuously but in small 
increments. To work with incremental data, here is the interface a UDF needs to 
implement:</p>
+<source>
+public interface Accumulator &lt;T&gt; {
+   /**
+    * Process tuples. Each DataBag may contain 0 to many tuples for current key
+    */
+    public void accumulate(Tuple b) throws IOException;
+    /**
+     * Called when all tuples from current key have been passed to accumulate.
+     * @return the value for the UDF for this key.
+     */
+    public T getValue();
+    /**
+     * Called after getValue() to prepare processing for next key. 
+     */
+    public void cleanup();
+}
+</source>
+
+<p>There are several things to note here:</p>
+
+<ol>
+       <li>Each UDF must extend the EvalFunc class and implement all necessary 
functions there.</li>
+       <li>If a function is algebraic but can be used in a FOREACH statement 
with accumulator functions, it needs to implement the Accumulator interface in 
addition to the Algebraic interface.</li>
+       <li>The interface is parameterized with the return type of the 
function.</li>
+       <li>The accumulate function is guaranteed to be called one or more 
times, passing one or more tuples in a bag, to the UDF. (Note that the tuple 
that is passed to the accumulator has the same content as the one passed to 
exec – all the parameters passed to the UDF – one of which should be a 
bag).</li>
+       <li>The getValue function is called after all the tuples for a 
particular key have been processed to retrieve the final value.</li>
+       <li>The cleanup function is called after getValue but before the next 
value is processed.</li>
+</ol>
+
+
+<p>Here us a code snippet of the integer version of the MAX function that 
implements the interface:</p>
+<source>
+public class IntMax extends EvalFunc&lt;Integer&gt; implements Algebraic, 
Accumulator&lt;Integer&gt; {
+    …….
+    /* Accumulator interface */
+    
+    private Integer intermediateMax = null;
+    
+    @Override
+    public void accumulate(Tuple b) throws IOException {
+        try {
+            Integer curMax = max(b);
+            if (curMax == null) {
+                return;
+            }
+            /* if bag is not null, initialize intermediateMax to negative 
infinity */
+            if (intermediateMax == null) {
+                intermediateMax = Integer.MIN_VALUE;
+            }
+            intermediateMax = java.lang.Math.max(intermediateMax, curMax);
+        } catch (ExecException ee) {
+            throw ee;
+        } catch (Exception e) {
+            int errCode = 2106;
+            String msg = "Error while computing max in " + 
this.getClass().getSimpleName();
+            throw new ExecException(msg, errCode, PigException.BUG, e);        
   
+        }
+    }
+
+    @Override
+    public void cleanup() {
+        intermediateMax = null;
+    }
+
+    @Override
+    public Integer getValue() {
+        return intermediateMax;
+    }
+}
+</source>
+
+</section>
+
 
+<section>
+<title>Advanced Topics</title>
 
 <section>
 <title>Function Instantiation</title>


Reply via email to