svn commit: r887019 - in /hadoop/pig/branches/branch-0.6: ./ src/docs/src/documentation/content/xdocs/

olga Thu, 03 Dec 2009 16:52:16 -0800

Author: olga
Date: Fri Dec  4 00:51:50 2009
New Revision: 887019

URL: http://svn.apache.org/viewvc?rev=887019&view=rev
Log:
PIG-1084: Pig 0.6.0 Documentation improvements  (chandec via olgan)


Modified:
    hadoop/pig/branches/branch-0.6/CHANGES.txt
    
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml
    
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
    
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml
    
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml

Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/CHANGES.txt?rev=887019&r1=887018&r2=887019&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.6/CHANGES.txt (original)
+++ hadoop/pig/branches/branch-0.6/CHANGES.txt Fri Dec  4 00:51:50 2009
@@ -24,6 +24,8 @@
 
 IMPROVEMENTS
 
+PIG-1084: Pig 0.6.0 Documentation improvements  (chandec via olgan)
+
 PIG-978: MQ docs update (chandec via olgan)
 
 PIG-872: use distributed cache for the replicated data set in FR join

Modified: 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=887019&r1=887018&r2=887019&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml
 Fri Dec  4 00:51:50 2009
@@ -64,7 +64,7 @@
 <section>
 <title>Project Early and Often </title>
 
-<p>Pig does not (yet) determine when a field is no longer needed and drop the 
field from the row.  For example, say you have a query like: </p>
+<p>Pig does not (yet) determine when a field is no longer needed and drop the 
field from the row. For example, say you have a query like: </p>
 
 <source>
 A = load 'myfile' as (t, u, v);
@@ -74,7 +74,7 @@
 E = foreach D generate group, COUNT($1);
 </source>
 
-<p>There is no need for v, y, or z to participate in this query.  And there is 
no need to carry both t and x past the join, just one will suffice.  Changing 
the above query to  </p>
+<p>There is no need for v, y, or z to participate in this query.  And there is 
no need to carry both t and x past the join, just one will suffice. Changing 
the query above to the query below will greatly reduce the amount of data being 
carried through the map and reduce phases by pig. </p>
 
 <source>
 A = load 'myfile' as (t, u, v);
@@ -87,8 +87,7 @@
 E = foreach D generate group, COUNT($1);
 </source>
 
-<p>will greatly reduce the amount of data being carried through the map and 
reduce phases by pig.  Depending on your data, this can produce significant 
time savings.  In queries similar to the example given we have seen total time 
drop by 50%. </p>
-
+<p>Depending on your data, this can produce significant time savings. In 
queries similar to the example shown here we have seen total time drop by 
50%.</p>
 </section>
 
 <section>
@@ -147,12 +146,9 @@
 </section>
 
 <section>
-<title>Make your UDFs Algebraic</title>
+<title>Make Your UDFs Algebraic</title>
 
-<p>Queries that can take advantage of the combiner generally ran much faster 
(sometimes several times faster) than the versions that don't. 
-The latest code significantly improves combiner usage; however, you need to 
make sure you do your part. 
-If you have a UDF that works on grouped data and is, by nature, algebraic 
(meaning their computation can be decomposed into multiple steps) 
-make sure you implement it as such. For details on how to write algebraic 
UDFs, see the <a href="udf.html">Pig UDF Manual</a>. </p>
+<p>Queries that can take advantage of the combiner generally ran much faster 
(sometimes several times faster) than the versions that don't. The latest code 
significantly improves combiner usage; however, you need to make sure you do 
your part. If you have a UDF that works on grouped data and is, by nature, 
algebraic (meaning their computation can be decomposed into multiple steps) 
make sure you implement it as such. For details on how to write algebraic UDFs, 
see the Pig UDF Manual and <a href="udf.html#Aggregate+Functions">Aggregate 
Functions</a>.</p>
 
 <source>
 A = load 'data' as (x, y, z)
@@ -162,12 +158,18 @@
 </source>
 
 <p>If <code>MyUDF</code> is algebraic, the query will use combiner and run 
much faster. You can run <code>explain</code> command on your query to make 
sure that combiner is used. </p>
+</section>
 
+<section>
+<title>Implement the Aggregator Interface</title>
+<p>
+If your UDF can't be made Algebraic but is able to deal with getting input in 
chunks rather than all at once, consider implementing the Aggregator interface 
to reduce the amount of memory used by your script. If your function 
<em>is</em> Algebraic and can be used on conjunction with Accumulator 
functions, you will need to implement the Accumulator interface as well as the 
Algebraic interface. For more information, see the Pig UDF Manual and <a 
href="udf.html#Accumulator+Interface">Accumulator Interface</a>.
+</p>
 </section>
 
+
 <section>
 <title>Drop Nulls Before a Join</title>
-<p>This comment only applies to pig 0.2.0 branch, as pig 0.1.0 does not have 
nulls. </p>
 <p>With the introduction of nulls, join and cogroup semantics were altered to 
work with nulls.  The semantic for cogrouping with nulls is that nulls from a 
given input are grouped together, but nulls across inputs are not grouped 
together.  This preserves the semantics of grouping (nulls are collected 
together from a single input to be passed to aggregate functions like COUNT) 
and the semantics of join (nulls are not joined across inputs).  Since 
flattening an empty bag results in an empty row, in a standard join the rows 
with a null key will always be dropped.  The join:  </p>
 
 <source>
@@ -199,10 +201,13 @@
 </section>
 
 <section>
-<title>Take Advantage of Join Optimization</title>
+<title>Take Advantage of Join Optimizations</title>
 
-<p>The optimization insures that the last table in the join is not brought 
into memory but stream through instead. The optimization reduces the amount of 
memory used which means you can avoid spilling the data and also should be able 
to scale your query to larger data volumes. </p>
-<p>To take advantage of this optimization, make sure that the table with the 
largest number of tuples per key is the last table in your query. </p>
+<section>
+<title>Regular Join Optimizations</title>
+<p>Optimization for regular joins ensures that the last table in the join is 
not brought into memory but streamed through instead. Optimization reduces the 
amount of memory used which means you can avoid spilling the data and also 
should be able to scale your query to larger data volumes. </p>
+<p>To take advantage of this optimization, make sure that the table with the 
largest number of tuples per key is the last table in your query. 
+In some of our tests we saw 10x performance improvement as the result of this 
optimization.</p>
 
 <source>
 small = load 'small_file' as (t, u, v);
@@ -210,34 +215,36 @@
 C = join small by t, large by x;
 </source>
 
-<p>In some of our tests we saw 10x performance improvement as the result of 
this optimization. </p>
 </section>
 
 <section>
-<title>Use Fragment Replicate Join</title>
+<title>Specialized Join Optimizations</title>
+<p>Optimization can also be achieved using fragment replicate joins, skewed 
joins, and merge joins. 
+For more information see <a 
href="piglatin_users.html#Specialized+Joins">Specialized Joins</a>.</p>
+</section>
 
-<p>This type of join works well if one of tables is small enough to fit into 
main memory. In this case, Pig can perform a very efficient join since, in 
hadoop world, it can be done completely on the map side. </p>
+</section>
 
-<source>
-tiny = load 'small_file' as (t, u, v);
-large = load 'large_file' as (x, y, z);
-C = join large by x, tiny by t using "replicated";
-</source>
 
-<p>Note that the large table must come first followed by one or more small 
tables. All small tables together must fit into main memory. </p>
-<p>This feature is new and experimental. It is experimental because we don't 
have a strong sense of how small the small table must be to fit into memory. In 
our tests with a simple query that involved just join a table of up to 100M can 
be used if the process overall gets 1 GB of memory. If the table does not fit 
into memory, the process would fail and generate an error. </p>
+<section>
+<title>Use the PARALLEL Keyword</title>
 
-</section>
+<p>PARALLEL controls the number of reducers invoked by Hadoop. The default 
value is 1. However, the number of reducers you need for a particular construct 
in Pig that forms a MapReduce boundary depends entirely on (1) your data and 
the number of intermediate keys you are generating in your mappers  and (2) the 
partitioner and distribution of map (combiner) output keys. In the best cases 
we have seen that a reducer processing about 500 MB of data behaves 
efficiently.</p>
 
-<section>
-<title>Use PARALLEL Keyword</title>
+<p>The keyword makes sense with any operator that starts a reduce phase. This 
includes  
+<a href="piglatin_reference.html#COGROUP">COGROUP</a>, 
+<a href="piglatin_reference.html#CROSS">CROSS</a>, 
+<a href="piglatin_reference.html#DISTINCT">DISTINCT</a>, 
+<a href="piglatin_reference.html#GROUP">GROUP</a>, 
+<a href="piglatin_reference.html#JOIN">JOIN</a>, 
+<a href="piglatin_reference.html#ORDER">ORDER</a>, and 
+<a href="piglatin_reference.html#JOIN%2C+OUTER">OUTER JOIN</a>.
 
-<p>PARALLEL controls the number of reducers invoked by Hadoop. The default out 
of the box is 1 which, in most cases, is not what you want. I reasonable 
heuristic to use is something like  </p>
+</p>
 
-<source>&lt;num machines&gt; * &lt;num reduce slots per machine&gt; * 
0.9</source>
+<p>You can set the value of PARALLEL in your scripts in conjunction with the 
operator (see the example below). You can also set the value of PARALLEL for 
all scripts using the <a href="piglatin_reference.html#set">set</a> command.</p>
 
-<p>The keyword makes sense on any operator that starts a reduce phase. This 
includes GROUP, COGROUP, JOIN, DISTINCT, LIMIT, ORDER BY. </p>
-<p>Example: </p>
+<p>Example</p>
 
 <source>
 A = load 'myfile' as (t, u, v);
@@ -248,7 +255,7 @@
 </section>
 
 <section>
-<title>Use LIMIT</title>
+<title>Use the LIMIT Operator</title>
 
 <p>A lot of the times, you are not interested in the entire output but either 
a sample or top results. In those cases, using LIMIT can yeild a much better 
performance as we push the limit as high as possible to minimize the amount of 
data travelling through the pipeline. </p>
 <p>Sample: 

Modified: 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml?rev=887019&r1=887018&r2=887019&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
 Fri Dec  4 00:51:50 2009
@@ -5451,7 +5451,8 @@
                <para>expression</para>
             </entry>
             <entry>
-               <para>A tuple expression. This is the group key or key field. 
If the result of the tuple expression is a single field, the key will be the 
value of the first field rather than a tuple with one field.</para>
+               <para>A tuple expression. This is the group key or key field. 
If the result of the tuple expression is a single field, the key will be the 
value of the first field rather than a tuple with one field. To group using 
multiple keys, enclose the keys in parentheses:</para>
+               <para>B = GROUP A BY (key1,key2);</para>
             </entry>
          </row>
          
@@ -5657,6 +5658,17 @@
    
    <section>
    <title>Example</title>
+<para>This example shows to group using multiple keys.</para>   
+<programlisting>
+ A = LOAD 'allresults' USING PigStorage() AS (tcid:int, tpid:int, 
date:chararray, result:chararray, tsid:int, tag:chararray);
+ B = GROUP A BY (tcid, tpid); 
+</programlisting>
+    </section>   
+   
+   
+   
+   <section>
+   <title>Example</title>
 <para>This example shows a map-side group.</para>   
 <programlisting>
  register zebra.jar;
@@ -5772,7 +5784,9 @@
    <section>
    <title>Usage</title>
    <para>Use the JOIN operator to perform an inner, equijoin join of two or 
more relations based on common field values. 
-   The JOIN operator always performs an inner join. Note that the JOIN and 
COGROUP operators perform similar functions. 
+   The JOIN operator always performs an inner join. Inner joins ignore null 
keys, so it makes sense to filter them out before the join.</para>
+   
+   <para>Note that the JOIN and COGROUP operators perform similar functions. 
    JOIN creates a flat set of output records while COGROUP creates a nested 
set of output records.</para>
     </section>
     
@@ -9784,6 +9798,17 @@
             <entry>
                <para>sets user-specified name for the job </para>
             </entry>
+            </row>
+            <row>
+            <entry>
+               <para>default_parallel</para>
+            </entry>
+            <entry>
+               <para>whole number </para>
+            </entry>
+            <entry>
+               <para>sets the number of reducers for all MapReduce jobs 
generated by Pig</para>
+            </entry>
          </row></tbody></tgroup>
    </informaltable>
    <para/>
@@ -9795,6 +9820,7 @@
 <programlisting>
 grunt&gt; set debug on
 grunt&gt; set job.name 'my job'
+grunt&gt; set default_parallel 100
 </programlisting>
    </section></section>
    

Modified: 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml?rev=887019&r1=887018&r2=887019&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml
 Fri Dec  4 00:51:50 2009
@@ -480,7 +480,7 @@
 <p>
 Pig Latin includes three "specialized" joins: fragement replicate joins, 
skewed joins, and merge joins. 
 Replicate, skewed, and merge joins can be performed using the <a 
href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins).
-Replicate and skewed joins can be performed using the the the <a 
href="piglatin_reference.html#JOIN%2C+OUTER">Outer Join</a> syntax.
+Replicate and skewed joins can also be performed using the the the <a 
href="piglatin_reference.html#JOIN%2C+OUTER">Outer Join</a> syntax.
 </p>
 
 <!-- FRAGMENT REPLICATE JOINS-->
@@ -755,77 +755,6 @@
 
 </section> <!-- END MEMORY MANAGEMENT  -->
 
- <!-- ZEBRA INTEGRATION -->
-<section>
-<title>Integration with Zebra</title>
- <p>This version of Pig is integrated with Zebra storage format. Zebra is a 
recent contrib project of Pig and the details can be found at 
http://wiki.apache.org/pig/zebra. Pig can now: </p>
- <ul>
- <li>Load data in Zebra format</li>
-  <li>Take advantage of sorted Zebra tables in case of map-side group and 
merge join.</li>
-  <li>Store data in Zebra format</li>
- </ul>
- <p></p>
- <p>To load data in Zebra format using TableLoader, do the following:</p>
- <source>
-register /zebra.jar;
-A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
-B = FOREACH A GENERATE name, age, gpa;
-</source>
-  
- <p>There are a couple of things to note:</p>
- <ol>
- <li>You need to register a Aebra jar file the same way you would do it for 
any other UDF.</li>
- <li>You need to place the jar on your classpath.</li>
- <li>Zebra data is self-described and always contains schema. This means that 
the AS clause is unnecessary as long as 
-  you know what the column names and types are. To determine the column names 
and types, you can run the DESCRIBE statement right after the load:
- <source>
-A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
-DESCRIBE A;
-a: {name: chararray,age: int,gpa: float}
-</source>
- </li>
- </ol>
-   
-<p>You can provide alternative names to the columns with AS clause. You can 
also provide types as long as the 
- original type can be converted to the new type.</p>
- 
-<p>You can provide multiple, comma-separated files to the loader:</p>
-<source>
-A = LOAD 'studenttab, votertab' USING 
org.apache.hadoop.zebra.pig.TableLoader();
-</source>
-
-<p>TableLoader supports efficient column selection. The current version of Pig 
does not support automatically pushing 
- projections down to the loader. (The work is in progress and will be done 
after beta.) 
- Meanwhile, the loader allows passing columns down via a list of arguments. 
This example tells the loader to only return two columns, name and age.</p>
-<source>
-A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader('name, 
age');
-</source>
-
-<p>If the input data is globally sorted, map-side group or merge join can be 
used. Please, notice the âsortedâ argument passed to the loader. This lets 
the loader know that the data is expected to be globally sorted and that a 
single key must be given to the same map.</p>
-
-<p>Here is an example of the merge join. Note that the first argument to the 
loader is left empty to indicate that all columns are requested.</p>
-<source>
-A = LOAD'studentsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 
'sorted');
-B = LOAD 'votersortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 
'sorted');
-G = JOIN A BY $0, B By $0 USING "merge";
-</source>
-
-<p>Here is an example of a map-side group. Note that multiple sorted files are 
passed to the loader and that the loader will perform sort preserving merge to 
make sure that the data is globally sorted.</p>
-<source>
-A = LOAD 'studentsortedtab, studentnullsortedtab' using 
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 
'sorted');
-B = GROUP A BY $0 USING "collected";
-C = FOREACH B GENERATE group, MAX(a.$1);
-</source>
-
-<p>You can also write data in Zebra format. Note that, since Zebra requires a 
schema to be stored with the data, the relation that is stored must have a name 
assigned (via alias) to every column in the relation.</p>
-<source>
-A = LOAD 'studentsortedtab, studentnullsortedtab' USING 
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 
'sorted');
-B = GROUP A BY $0 USING "collected";
-C = FOREACH B GENERATE group, MAX(a.$1) AS max_val;
-STORE C INTO 'output' USING org.apache.hadoop.zebra.pig.TableStorer('');
-</source>
-
- </section> <!-- END ZEBRA INTEGRATION  -->
  
  
  

Modified: 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml?rev=887019&r1=887018&r2=887019&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml 
(original)
+++ 
hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml 
Fri Dec  4 00:51:50 2009
@@ -342,7 +342,7 @@
 </tr>
 </table>
 
-<p>All Pig-specific classes are available <a 
href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/";> 
here</a> </p>
+<p>All Pig-specific classes are available <a 
href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/";> 
here</a>. </p>
 <p><code>Tuple</code> and <code>DataBag</code> are different in that they are 
not concrete classes but rather interfaces. This enables users to extend Pig 
with their own versions of tuples and bags. As a result, UDFs cannot directly 
instantiate bags or tuples; they need to go through factory classes: 
<code>TupleFactory</code> and <code>BagFactory</code>. </p>
 <p>The builtin <code>TOKENIZE</code> function shows how bags and tuples are 
created. A function takes a text string as input and returns a bag of words 
from the text. (Note that currently Pig bags always contain tuples.) </p>
 
@@ -866,7 +866,7 @@
 </section>
 
 <section>
-<title>Accumulate Interface</title>
+<title>Accumulator Interface</title>
 
 <p>In Pig, problems with memory usage can occur when data, which results from 
a group or cogroup operation, needs to be placed in a bag  and passed in its 
entirety to a UDF.</p>
 
@@ -880,7 +880,7 @@
     */
     public void accumulate(Tuple b) throws IOException;
     /**
-     * Called when all tuples from current key have been passed to accumulate.
+     * Called when all tuples from current key have been passed to the 
accumulator.
      * @return the value for the UDF for this key.
      */
     public T getValue();

svn commit: r887019 - in /hadoop/pig/branches/branch-0.6: ./ src/docs/src/documentation/content/xdocs/

Reply via email to