Author: olga
Date: Thu Nov 12 18:43:45 2009
New Revision: 835496
URL: http://svn.apache.org/viewvc?rev=835496&view=rev
Log:
PIG-1089: Pig 0.6.0 Documentation (chandec via olgan)
Modified:
hadoop/pig/trunk/CHANGES.txt
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
Modified: hadoop/pig/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/hadoop/pig/trunk/CHANGES.txt?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/CHANGES.txt (original)
+++ hadoop/pig/trunk/CHANGES.txt Thu Nov 12 18:43:45 2009
@@ -26,6 +26,8 @@
IMPROVEMENTS
+PIG-1089: Pig 0.6.0 Documentation (chandec via olgan)
+
PIG-958: Splitting output data on key field (ankur via pradeepkth)
PIG-1058: FINDBUGS: remaining "Correctness Warnings" (olgan)
Modified:
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
---
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
(original)
+++
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
Thu Nov 12 18:43:45 2009
@@ -5412,7 +5412,7 @@
<informaltable frame="all">
<tgroup cols="1"><tbody><row>
<entry>
- <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL
| BY expression â¦] [PARALLEL n];</para>
+ <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL
| BY expression â¦] [USING "collected"] [PARALLEL n];</para>
</entry>
</row></tbody></tgroup>
</informaltable></section>
@@ -5454,6 +5454,27 @@
<para>A tuple expression. This is the group key or key field.
If the result of the tuple expression is a single field, the key will be the
value of the first field rather than a tuple with one field.</para>
</entry>
</row>
+
+ <row>
+ <entry>
+ <para>USING</para>
+ </entry>
+ <entry>
+ <para>Keyword</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>"collected"</para>
+ </entry>
+ <entry>
+ <para>Allows for more efficient computation of a group if the
loader guarantees that the data for the
+ same key is continuous and is given to a single map. As of this
release, only the Zebra loader makes this
+ guarantee. The efficiency is achieved by performing the group
operation in map
+ rather than reduce (see <ulink
url="piglatin_users.html#Integration+with+Zebra">Integration with
Zebra</ulink>). This feature cannot be used with the COGROUP operator.</para>
+ </entry>
+ </row>
+
<row>
<entry>
@@ -5553,6 +5574,10 @@
(19,{(Mary)})
(20,{(Bill)})
</programlisting>
+
+ </section>
+ <section>
+ <title>Example</title>
<para>Suppose we have relation A.</para>
<programlisting>
@@ -5629,6 +5654,18 @@
</programlisting>
</section>
+
+ <section>
+ <title>Example</title>
+<para>This example shows a map-side group.</para>
+<programlisting>
+ register zebra.jar;
+ A = LOAD 'studentsortedtab' USING
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpaâ, 'sorted');
+ B = GROUP A BY name USING "collected";
+ C = FOREACH b GENERATE group, MAX(a.age), COUNT_STAR(a);
+</programlisting>
+ </section>
+
</section>
<section>
@@ -5793,7 +5830,8 @@
<informaltable frame="all">
<tgroup cols="1"><tbody><row>
<entry>
- <para>alias = JOIN left-alias BY left-alias-column
[LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [PARALLEL n];Â
</para>
+ <para>alias = JOIN left-alias BY left-alias-column
[LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column
+ [USING "replicated" | "skewed"] [PARALLEL n];Â </para>
</entry>
</row></tbody></tgroup>
</informaltable>
@@ -5865,6 +5903,34 @@
</entry>
</row>
+ <row>
+ <entry>
+ <para>USING</para>
+ </entry>
+ <entry>
+ <para>Keyword</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>"replicated"</para>
+ </entry>
+ <entry>
+ <para>Use to perform fragment replicate joins (see <ulink
url="piglatin_users.html#Fragment+Replicate+Joins">Fragment Replicate
Joins</ulink>).</para>
+ <para>Only left outer join is supported for replicated outer
join.</para>
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <para>"skewed"</para>
+ </entry>
+ <entry>
+ <para>Use to perform skewed joins (see <ulink
url="piglatin_users.html#Skewed+Joins">Skewed Joins</ulink>).</para>
+ </entry>
+ </row>
+
+
<row>
<entry>
<para>PARALLEL n</para>
@@ -5929,6 +5995,21 @@
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A BY $0 FULL, B BY $0;
</programlisting>
+
+<para>This example shows a replicated left outer join.</para>
+<programlisting>
+A = LOAD âlargeâ;
+B = LOAD âtinyâ;
+C= JOIN A BY $0 LEFT, B BY $0 USING "replicated";
+</programlisting>
+
+<para>This example shows a skewed full outer join.</para>
+<programlisting>
+A = LOAD âstudenttabâ as (name, age, gpa);
+B = LOAD 'votertab' as (name, age, registration, contribution);
+C = JOIN A BY name FULL, B BY name USING "skewed";
+</programlisting>
+
</section>
</section>
@@ -8739,12 +8820,78 @@
</programlisting>
</section></section></section>
+ <!-- Shell COMMANDS-->
+ <section>
+ <title>Shell Commands</title>
+
+ <section>
+ <title>fs</title>
+ <para>Invokes any FSShell command from within a Pig script or the Grunt
shell.</para>
+
+ <section>
+ <title>Syntax </title>
+ <informaltable frame="all">
+ <tgroup cols="1"><tbody><row>
+ <entry>
+ <para>fs subcommand subcommand_parameters </para>
+ </entry>
+ </row></tbody></tgroup>
+ </informaltable></section>
+
+ <section>
+ <title>Terms</title>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <tbody>
+ <row>
+ <entry>
+ <para>subcommand</para>
+ </entry>
+ <entry>
+ <para>The FSShell command.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>subcommand_parameters</para>
+ </entry>
+ <entry>
+ <para>The FSShell command parameters.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ </section>
+ <section>
+ <title>Usage</title>
+ <para>Use the fs command to invoke any FSShell command from within a Pig
script or Grunt shell.
+ The fs command greatly extends the set of supported file system commands
and the capabilities
+ supported for existing commands such as ls that will now support globing.
For a complete list of
+ FSShell commands, see
+ <ulink
url="http://hadoop.apache.org/common/docs/current/hdfs_shell.html">HDFS File
System Shell Guide</ulink></para>
+ </section>
+
+ <section>
+ <title>Examples</title>
+ <para>In these examples, a directory is created, a file is copied, a file
is listed.</para>
+<programlisting>
+fs -mkdir /tmp
+fs -copyFromLocal file-x file-y
+fs -ls file-y
+</programlisting>
+ </section>
+ </section>
+ </section>
+
<!-- FILE COMMANDS-->
<section>
<title>File Commands</title>
-
+ <para>Note: Beginning with Pig 0.6.0, the file commands are now deprecated
and will be removed in a future release.
+ Start using Pig's -fs command to invoke the shell commands <ulink
url="piglatin_reference.html#Shell+Commands">shell commands</ulink>.
+ </para>
<section>
<title>cat</title>
<para>Prints the content of one or more files to the screen.</para>
@@ -8786,7 +8933,8 @@
john adams
anne white
</programlisting>
- </section></section>
+ </section>
+ </section>
<section>
<title>cd</title>
@@ -8984,96 +9132,7 @@
</programlisting>
</section></section>
- <section>
- <title>exec</title>
- <para>Run a Pig script.</para>
-
- <section>
- <title>Syntax</title>
- <informaltable frame="all">
- <tgroup cols="1"><tbody><row>
- <entry>
- <para>exec [âparam param_name = param_value] [âparam_file
file_name] script </para>
- </entry>
- </row></tbody></tgroup>
- </informaltable></section>
-
- <section>
- <title>Terms</title>
- <informaltable frame="all">
- <tgroup cols="2"><tbody>
- <row>
- <entry>
- <para>âparam param_name = param_value</para>
- </entry>
- <entry>
- <para>See Parameter Substitution.</para>
- </entry>
- </row>
-
- <row>
- <entry>
- <para>âparam_file file_name</para>
- </entry>
- <entry>
- <para>See Parameter Substitution. </para>
- </entry>
- </row>
-
- <row>
- <entry>
- <para>script</para>
- </entry>
- <entry>
- <para>The name of a Pig script.</para>
- </entry>
- </row>
-
- </tbody></tgroup>
- </informaltable></section>
-
- <section>
- <title>Usage</title>
- <para>Use the exec command to run a Pig script with no interaction between
the script and the Grunt shell (batch mode). Aliases defined in the script are
not available to the shell; however, the files produced as the output of the
script and stored on the system are visible after the script is run. Aliases
defined via the shell are not available to the script. </para>
- <para>With the exec command, store statements will not trigger execution;
rather, the entire script is parsed before execution starts. Unlike the run
command, exec does not change the command history or remembers the handles used
inside the script. Exec without any parameters can be used in scripts to force
execution up to the point in the script where the exec occurs. </para>
- <para>For comparison, see the run command. Both the exec and run commands
are useful for debugging because you can modify a Pig script in an editor and
then rerun the script in the Grunt shell without leaving the shell. Also, both
commands promote Pig script modularity as they allow you to reuse existing
components.</para>
- </section>
-
- <section>
- <title>Examples</title>
- <para>In this example the script is displayed and run.</para>
-
-<programlisting>
-grunt> cat myscript.pig
-a = LOAD 'student' AS (name, age, gpa);
-b = LIMIT a 3;
-DUMP b;
-
-grunt> exec myscript.pig
-(alice,20,2.47)
-(luke,18,4.00)
-(holly,24,3.27)
-</programlisting>
-
- <para>In this example parameter substitution is used with the exec
command.</para>
-<programlisting>
-grunt> cat myscript.pig
-a = LOAD 'student' AS (name, age, gpa);
-b = ORDER a BY name;
-
-STORE b into '$out';
-
-grunt> exec âparam out=myoutput myscript.pig
-</programlisting>
-
- <para>In this example multiple parameters are specified.</para>
-<programlisting>
-grunt> exec âparam p1=myparam1 âparam p2=myparam2 myscript.pig
-</programlisting>
-
- </section>
-
- </section>
+
<section>
<title>ls</title>
@@ -9343,8 +9402,15 @@
</programlisting>
</section></section>
+
+ </section>
+
+
<section>
- <title>run</title>
+ <title>Utility Commands</title>
+
+ <section>
+ <title>exec</title>
<para>Run a Pig script.</para>
<section>
@@ -9352,7 +9418,7 @@
<informaltable frame="all">
<tgroup cols="1"><tbody><row>
<entry>
- <para>run [âparam param_name = param_value] [âparam_file
file_name] script </para>
+ <para>exec [âparam param_name = param_value] [âparam_file
file_name] script </para>
</entry>
</row></tbody></tgroup>
</informaltable></section>
@@ -9361,23 +9427,24 @@
<title>Terms</title>
<informaltable frame="all">
<tgroup cols="2"><tbody>
- <row>
+ <row>
<entry>
<para>âparam param_name = param_value</para>
</entry>
<entry>
<para>See Parameter Substitution.</para>
</entry>
- </row>
+ </row>
- <row>
+ <row>
<entry>
<para>âparam_file file_name</para>
</entry>
<entry>
<para>See Parameter Substitution. </para>
</entry>
- </row>
+ </row>
+
<row>
<entry>
<para>script</para>
@@ -9392,49 +9459,47 @@
<section>
<title>Usage</title>
- <para>Use the run command to run a Pig script that can interact with the
Grunt shell (interactive mode). The script has access to aliases defined
externally via the Grunt shell. The Grunt shell has access to aliases defined
within the script. All commands from the script are visible in the command
history. </para>
- <para>With the run command, every store triggers execution. The
statements from the script are put into the command history and all the aliases
defined in the script can be referenced in subsequent statements after the run
command has completed. Issuing a run command on the grunt command line has
basically the same effect as typing the statements manually. </para>
- <para>For comparison, see the exec command. Both the run and exec commands
are useful for debugging because you can modify a Pig script in an editor and
then rerun the script in the Grunt shell without leaving the shell. Also, both
commands promote Pig script modularity as they allow you to reuse existing
components.</para>
- </section>
+ <para>Use the exec command to run a Pig script with no interaction between
the script and the Grunt shell (batch mode). Aliases defined in the script are
not available to the shell; however, the files produced as the output of the
script and stored on the system are visible after the script is run. Aliases
defined via the shell are not available to the script. </para>
+ <para>With the exec command, store statements will not trigger execution;
rather, the entire script is parsed before execution starts. Unlike the run
command, exec does not change the command history or remembers the handles used
inside the script. Exec without any parameters can be used in scripts to force
execution up to the point in the script where the exec occurs. </para>
+ <para>For comparison, see the run command. Both the exec and run commands
are useful for debugging because you can modify a Pig script in an editor and
then rerun the script in the Grunt shell without leaving the shell. Also, both
commands promote Pig script modularity as they allow you to reuse existing
components.</para>
+ </section>
<section>
- <title>Example</title>
- <para>In this example the script interacts with the results of commands
issued via the Grunt shell.</para>
+ <title>Examples</title>
+ <para>In this example the script is displayed and run.</para>
+
<programlisting>
grunt> cat myscript.pig
-b = ORDER a BY name;
-c = LIMIT b 10;
-
-grunt> a = LOAD 'student' AS (name, age, gpa);
-
-grunt> run myscript.pig
-
-grunt> d = LIMIT c 3;
+a = LOAD 'student' AS (name, age, gpa);
+b = LIMIT a 3;
+DUMP b;
-grunt> DUMP d;
+grunt> exec myscript.pig
(alice,20,2.47)
-(alice,27,1.95)
-(alice,36,2.27)
+(luke,18,4.00)
+(holly,24,3.27)
</programlisting>
-
-
- <para>In this example parameter substitution is used with the run
command.</para>
-<programlisting>
-grunt> a = LOAD 'student' AS (name, age, gpa);
+ <para>In this example parameter substitution is used with the exec
command.</para>
+<programlisting>
grunt> cat myscript.pig
+a = LOAD 'student' AS (name, age, gpa);
b = ORDER a BY name;
+
STORE b into '$out';
-grunt> run âparam out=myoutput myscript.pig
+grunt> exec âparam out=myoutput myscript.pig
</programlisting>
-
- </section></section>
+
+ <para>In this example multiple parameters are specified.</para>
+<programlisting>
+grunt> exec âparam p1=myparam1 âparam p2=myparam2 myscript.pig
+</programlisting>
+
</section>
+ </section>
- <section>
- <title>Utility Commands</title>
<section>
<title>help</title>
@@ -9557,6 +9622,97 @@
</programlisting>
</section></section>
+
+ <section>
+ <title>run</title>
+ <para>Run a Pig script.</para>
+
+ <section>
+ <title>Syntax</title>
+ <informaltable frame="all">
+ <tgroup cols="1"><tbody><row>
+ <entry>
+ <para>run [âparam param_name = param_value] [âparam_file
file_name] script </para>
+ </entry>
+ </row></tbody></tgroup>
+ </informaltable></section>
+
+ <section>
+ <title>Terms</title>
+ <informaltable frame="all">
+ <tgroup cols="2"><tbody>
+ <row>
+ <entry>
+ <para>âparam param_name = param_value</para>
+ </entry>
+ <entry>
+ <para>See Parameter Substitution.</para>
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <para>âparam_file file_name</para>
+ </entry>
+ <entry>
+ <para>See Parameter Substitution. </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>script</para>
+ </entry>
+ <entry>
+ <para>The name of a Pig script.</para>
+ </entry>
+ </row>
+
+ </tbody></tgroup>
+ </informaltable></section>
+
+ <section>
+ <title>Usage</title>
+ <para>Use the run command to run a Pig script that can interact with the
Grunt shell (interactive mode). The script has access to aliases defined
externally via the Grunt shell. The Grunt shell has access to aliases defined
within the script. All commands from the script are visible in the command
history. </para>
+ <para>With the run command, every store triggers execution. The
statements from the script are put into the command history and all the aliases
defined in the script can be referenced in subsequent statements after the run
command has completed. Issuing a run command on the grunt command line has
basically the same effect as typing the statements manually. </para>
+ <para>For comparison, see the exec command. Both the run and exec commands
are useful for debugging because you can modify a Pig script in an editor and
then rerun the script in the Grunt shell without leaving the shell. Also, both
commands promote Pig script modularity as they allow you to reuse existing
components.</para>
+ </section>
+
+ <section>
+ <title>Example</title>
+ <para>In this example the script interacts with the results of commands
issued via the Grunt shell.</para>
+<programlisting>
+grunt> cat myscript.pig
+b = ORDER a BY name;
+c = LIMIT b 10;
+
+grunt> a = LOAD 'student' AS (name, age, gpa);
+
+grunt> run myscript.pig
+
+grunt> d = LIMIT c 3;
+
+grunt> DUMP d;
+(alice,20,2.47)
+(alice,27,1.95)
+(alice,36,2.27)
+</programlisting>
+
+
+ <para>In this example parameter substitution is used with the run
command.</para>
+<programlisting>
+grunt> a = LOAD 'student' AS (name, age, gpa);
+
+grunt> cat myscript.pig
+b = ORDER a BY name;
+STORE b into '$out';
+
+grunt> run âparam out=myoutput myscript.pig
+</programlisting>
+
+ </section></section>
+
+
+
<section>
<title>set</title>
<para>Assigns values to keys used in Pig.</para>
Modified:
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
---
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
(original)
+++
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
Thu Nov 12 18:43:45 2009
@@ -158,12 +158,12 @@
<title>Increasing Parallelism</title>
<p>To increase the parallelism of a job, include the PARALLEL clause with
the COGROUP, CROSS, DISTINCT, GROUP, JOIN and ORDER operators.
PARALLEL controls the number of reducers only; the number of maps is
determined by the input data
- (see the <a href="http://wiki.apache.org/pig/PigUserCookbook">Pig User
Cookbook</a>).</p>
+ (see the <a href="cookbook.html">Pig Cookbook</a>).</p>
</section>
<section><title>Increasing Performance</title>
<p>You can increase or optimize the performance of your Pig Latin scripts
by following a few simple rules
- (see the <a href="http://wiki.apache.org/pig/PigUserCookbook">Pig User
Cookbook</a>).</p>
+ (see the <a href="cookbook.html">Pig Cookbook</a>).</p>
</section>
<section>
@@ -420,8 +420,8 @@
<title>Specialized Joins</title>
<p>
Pig Latin includes three "specialized" joins: fragement replicate joins,
skewed joins, and merge joins.
-These joins are performed using the <a
href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins).
-Currently, these joins <strong>cannot</strong> be performed using outer joins.
+Replicate, skewed, and merge joins can be performed using the <a
href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins).
+Replicate and skewed joins can be performed using the the the <a
href="piglatin_reference.html#JOIN%2C+OUTER">Outer Join</a> syntax.
</p>
<!-- FRAGMENT REPLICATE JOINS-->
@@ -434,7 +434,7 @@
<section>
<title>Usage</title>
-<p>Perform a fragment replicate join with the USING clause (see the <a
href="piglatin_reference.html#JOIN">JOIN</a> operator).
+<p>Perform a fragment replicate join with the USING clause (see <a
href="piglatin_reference.html#JOIN">JOIN</a> and <a
href="piglatin_reference.html#JOIN%2C+OUTER">JOIN, OUTER</a>).
In this example, a large relation is joined with two smaller relations. Note
that the large relation comes first followed by the smaller relations;
and, all small relations together must fit into main memory, otherwise an
error is generated. </p>
<source>
@@ -478,11 +478,11 @@
<section>
<title>Usage</title>
-<p>Perform a skewed join with the USING clause (see the <a
href="piglatin_reference.html#JOIN">JOIN</a> operator). </p>
+<p>Perform a skewed join with the USING clause (see <a
href="piglatin_reference.html#JOIN">JOIN</a> and <a
href="piglatin_reference.html#JOIN%2C+OUTER">JOIN, OUTER</a>). </p>
<source>
big = LOAD 'big_data' AS (b1,b2,b3);
massive = LOAD 'massive_data' AS (m1,m2,m3);
-c = JOIN big BY b1, massive BY m1 USING "skewed";
+C = JOIN big BY b1, massive BY m1 USING "skewed";
</source>
</section>
@@ -683,9 +683,92 @@
D = JOIN C BY $1, B BY $1;
</source>
</section>
+</section> <!-- END OPTIMIZATION RULES -->
+
+ <!-- MEMORY MANAGEMENT -->
+<section>
+<title>Memory Management</title>
+<p>For Pig 0.6.0 we changed how Pig decides when to spill bags to disk. In the
past, Pig tried to figure out when an application was getting close to memory
limit and then spill at that time. However, because Java does not include an
accurate way to determine when to spill, Pig often ran out of memory. </p>
+<p>In the current version, we allocate a fix amount of memory to store bags
and spill to disk as soon as the memory limit is reached. This is very similar
to how Hadoop decides when to spill data accumulated by the combiner. </p>
-</section> <!-- END OPTIMIZATION RULES -->
+<p>The amount of memory allocated to bags is determined by
pig.cachedbag.memusage; the default is set to 10% of available memory. Note
that this memory is shared across all large bags used by the application.</p>
+
+</section> <!-- END MEMORY MANAGEMENT -->
+
+ <!-- ZEBRA INTEGRATION -->
+<section>
+<title>Integration with Zebra</title>
+ <p>This version of Pig is integrated with Zebra storage format. Zebra is a
recent contrib project of Pig and the details can be found at
http://wiki.apache.org/pig/zebra. Pig can now: </p>
+ <ul>
+ <li>Load data in Zebra format</li>
+ <li>Take advantage of sorted Zebra tables in case of map-side group and
merge join.</li>
+ <li>Store data in Zebra format</li>
+ </ul>
+ <p></p>
+ <p>To load data in Zebra format using TableLoader, do the following:</p>
+ <source>
+register /zebra.jar;
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
+B = FOREACH A GENERATE name, age, gpa;
+</source>
+
+ <p>There are a couple of things to note:</p>
+ <ol>
+ <li>You need to register a Aebra jar file the same way you would do it for
any other UDF.</li>
+ <li>You need to place the jar on your classpath.</li>
+ <li>Zebra data is self-described and always contains schema. This means that
the AS clause is unnecessary as long as
+ you know what the column names and types are. To determine the column names
and types, you can run the DESCRIBE statement right after the load:
+ <source>
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
+DESCRIBE A;
+a: {name: chararray,age: int,gpa: float}
+</source>
+ </li>
+ </ol>
+
+<p>You can provide alternative names to the columns with AS clause. You can
also provide types as long as the
+ original type can be converted to the new type.</p>
+
+<p>You can provide multiple, comma-separated files to the loader:</p>
+<source>
+A = LOAD 'studenttab, votertab' USING
org.apache.hadoop.zebra.pig.TableLoader();
+</source>
+
+<p>TableLoader supports efficient column selection. The current version of Pig
does not support automatically pushing
+ projections down to the loader. (The work is in progress and will be done
after beta.)
+ Meanwhile, the loader allows passing columns down via a list of arguments.
This example tells the loader to only return two columns, name and age.</p>
+<source>
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader('name,
age');
+</source>
+
+<p>If the input data is globally sorted, map-side group or merge join can be
used. Please, notice the âsortedâ argument passed to the loader. This lets
the loader know that the data is expected to be globally sorted and that a
single key must be given to the same map.</p>
+
+<p>Here is an example of the merge join. Note that the first argument to the
loader is left empty to indicate that all columns are requested.</p>
+<source>
+A = LOAD'studentsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('',
'sorted');
+B = LOAD 'votersortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('',
'sorted');
+G = JOIN A BY $0, B By $0 USING "merge";
+</source>
+
+<p>Here is an example of a map-side group. Note that multiple sorted files are
passed to the loader and that the loader will perform sort preserving merge to
make sure that the data is globally sorted.</p>
+<source>
+A = LOAD 'studentsortedtab, studentnullsortedtab' using
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table',
'sorted');
+B = GROUP A BY $0 USING "collected";
+C = FOREACH B GENERATE group, MAX(a.$1);
+</source>
+
+<p>You can also write data in Zebra format. Note that, since Zebra requires a
schema to be stored with the data, the relation that is stored must have a name
assigned (via alias) to every column in the relation.</p>
+<source>
+A = LOAD 'studentsortedtab, studentnullsortedtab' USING
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table',
'sorted');
+B = GROUP A BY $0 USING "collected";
+C = FOREACH B GENERATE group, MAX(a.$1) AS max_val;
+STORE C INTO 'output' USING org.apache.hadoop.zebra.pig.TableStorer('');
+</source>
+
+ </section> <!-- END ZEBRA INTEGRATION -->
+
+
</body>
</document>
Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml Thu Nov
12 18:43:45 2009
@@ -32,6 +32,6 @@
-->
<tab label="Project" href="http://hadoop.apache.org/pig/" type="visible" />
<tab label="Wiki" href="http://wiki.apache.org/pig/" type="visible" />
- <tab label="Pig 0.5.0 Documentation" dir="" type="visible" />
+ <tab label="Pig 0.6.0 Documentation" dir="" type="visible" />
</tabs>
Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml Thu Nov
12 18:43:45 2009
@@ -866,8 +866,89 @@
</section>
<section>
-<title>Advanced Topics</title>
+<title>Accumulate Interface</title>
+
+<p>In Pig, problems with memory usage can occur when data, which results from
a group or cogroup operation, needs to be placed in a bag and passed in its
entirety to a UDF.</p>
+
+<p>This problem is partially addressed by Algebraic UDFs that use the combiner
and can deal with data being passed to them incrementally during different
processing phases (map, combiner, and reduce.) However, there are a number of
UDFs that are not Algebraic, don't use the combiner, but still donât need to
be given all data at once. </p>
+
+<p>The new Accumulator interface is designed to decrease memory usage by
targeting such UDFs. For the functions that implement this interface, Pig
guarantees that the data for the same key is passed continuously but in small
increments. To work with incremental data, here is the interface a UDF needs to
implement:</p>
+<source>
+public interface Accumulator <T> {
+ /**
+ * Process tuples. Each DataBag may contain 0 to many tuples for current key
+ */
+ public void accumulate(Tuple b) throws IOException;
+ /**
+ * Called when all tuples from current key have been passed to accumulate.
+ * @return the value for the UDF for this key.
+ */
+ public T getValue();
+ /**
+ * Called after getValue() to prepare processing for next key.
+ */
+ public void cleanup();
+}
+</source>
+
+<p>There are several things to note here:</p>
+
+<ol>
+ <li>Each UDF must extend the EvalFunc class and implement all necessary
functions there.</li>
+ <li>If a function is algebraic but can be used in a FOREACH statement
with accumulator functions, it needs to implement the Accumulator interface in
addition to the Algebraic interface.</li>
+ <li>The interface is parameterized with the return type of the
function.</li>
+ <li>The accumulate function is guaranteed to be called one or more
times, passing one or more tuples in a bag, to the UDF. (Note that the tuple
that is passed to the accumulator has the same content as the one passed to
exec â all the parameters passed to the UDF â one of which should be a
bag).</li>
+ <li>The getValue function is called after all the tuples for a
particular key have been processed to retrieve the final value.</li>
+ <li>The cleanup function is called after getValue but before the next
value is processed.</li>
+</ol>
+
+
+<p>Here us a code snippet of the integer version of the MAX function that
implements the interface:</p>
+<source>
+public class IntMax extends EvalFunc<Integer> implements Algebraic,
Accumulator<Integer> {
+ â¦â¦.
+ /* Accumulator interface */
+
+ private Integer intermediateMax = null;
+
+ @Override
+ public void accumulate(Tuple b) throws IOException {
+ try {
+ Integer curMax = max(b);
+ if (curMax == null) {
+ return;
+ }
+ /* if bag is not null, initialize intermediateMax to negative
infinity */
+ if (intermediateMax == null) {
+ intermediateMax = Integer.MIN_VALUE;
+ }
+ intermediateMax = java.lang.Math.max(intermediateMax, curMax);
+ } catch (ExecException ee) {
+ throw ee;
+ } catch (Exception e) {
+ int errCode = 2106;
+ String msg = "Error while computing max in " +
this.getClass().getSimpleName();
+ throw new ExecException(msg, errCode, PigException.BUG, e);
+ }
+ }
+
+ @Override
+ public void cleanup() {
+ intermediateMax = null;
+ }
+
+ @Override
+ public Integer getValue() {
+ return intermediateMax;
+ }
+}
+</source>
+
+</section>
+
+<section>
+<title>Advanced Topics</title>
<section>
<title>Function Instantiation</title>