Author: olga Date: Fri Sep 3 21:06:37 2010 New Revision: 992468 URL: http://svn.apache.org/viewvc?rev=992468&view=rev Log: PIG-1600: Docs update (chandec via olgan)
Modified: hadoop/pig/branches/branch-0.8/CHANGES.txt hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/cookbook.xml hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/tabs.xml hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/tutorial.xml hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/udf.xml Modified: hadoop/pig/branches/branch-0.8/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.8/CHANGES.txt?rev=992468&r1=992467&r2=992468&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.8/CHANGES.txt (original) +++ hadoop/pig/branches/branch-0.8/CHANGES.txt Fri Sep 3 21:06:37 2010 @@ -26,6 +26,8 @@ PIG-1249: Safe-guards against misconfigu IMPROVEMENTS +PIG-1600: Docs update (chandec via olgan) + PIG-1585: Add new properties to help and documentation(olgan) PIG-1399: Filter expression optimizations (yanz via gates) Modified: hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/cookbook.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=992468&r1=992467&r2=992468&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/cookbook.xml (original) +++ hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/cookbook.xml Fri Sep 3 21:06:37 2010 @@ -259,7 +259,7 @@ B = GROUP A BY t PARALLEL 18; <p>In this example all the MapReduce jobs that get launched use 20 reducers.</p> <source> -SET DEFAULT_PARALLEL 20; +SET default_parallel 20; A = LOAD âmyfile.txtâ USING PigStorage() AS (t, u, v); B = GROUP A BY t; C = FOREACH B GENERATE group, COUNT(A.t) as mycount; @@ -292,11 +292,11 @@ C = limit B 500; </section> <section> -<title>Prefer DISTINCT over GROUP BY - GENERATE</title> +<title>Prefer DISTINCT over GROUP BY/GENERATE</title> -<p>When it comes to extracting the unique values from a column in a relation, one of two approaches can be used: </p> +<p>To extract unique values from a column in a relation you can use DISTINCT or GROUP BY/GENERATE. DISTINCT is the preferred method; it is faster and more efficient.</p> -<p>Example Using GROUP BY - GENERATE</p> +<p>Example using GROUP BY - GENERATE:</p> <source> A = load 'myfile' as (t, u, v); @@ -306,7 +306,7 @@ D = foreach C generate group as uniqueke dump D; </source> -<p>Example Using DISTINCT</p> +<p>Example using DISTINCT:</p> <source> A = load 'myfile' as (t, u, v); @@ -315,8 +315,6 @@ C = distinct B; dump C; </source> -<p>In pig 0.1.x, DISTINCT is just GROUP BY/PROJECT under the hood. In pig 0.2.0 it is not, and it is much faster and more efficient (depending on your key cardinality, up to 20x faster in pig team's tests). Therefore, the use of DISTINCT is recommended over GROUP BY - GENERATE. </p> - </section> </section> </body> Modified: hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml?rev=992468&r1=992467&r2=992468&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml (original) +++ hadoop/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml Fri Sep 3 21:06:37 2010 @@ -32,6 +32,8 @@ <p>Also, be sure to review the information in the <a href="cookbook.html">Pig Cookbook</a>. </p> </section> + <!-- ==================================================================== --> + <!-- PIG LATIN STATEMENTS --> <section> <title>Pig Latin Statements</title> @@ -62,8 +64,7 @@ <li>Either interactively or in batch </li> </ul> - - <p></p> + <p></p> <p>Note that Pig now uses Hadoop's local mode (rather than Pig's native local mode).</p> <p>A few run examples are shown here; see <a href="setup.html">Pig Setup</a> for more examples.</p> @@ -126,8 +127,6 @@ DUMP B; <p> </p> <p>See <a href="#Multi-Query+Execution">Multi-Query Execution</a> for more information on how Pig Latin statements are processed.</p> - - </section> <section> @@ -233,8 +232,20 @@ grunt> DUMP C; </section> <!-- END PIG LATIN STATEMENTS --> - +<!-- ================================================================== --> +<!-- MEMORY MANAGEMENT --> +<section> +<title>Memory Management</title> +<p>Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p> + +<p>The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 10% of available memory. Note that this memory is shared across all large bags used by the application.</p> + +</section> +<!-- END MEMORY MANAGEMENT --> + + +<!-- ==================================================================== --> <!-- MULTI-QUERY EXECUTION--> <section> <title>Multi-Query Execution</title> @@ -378,7 +389,6 @@ $ pig -F myscript.pig or $ pig -stop_on_failure myscript.pig </source> - </section> <section> @@ -497,25 +507,189 @@ Ftab = group .... Gtab = .... aggregation function STORE Gtab INTO '/user/vxj/finalresult2'; </source> +</section> +</section> + +</section> +<!-- END MULTI-QUERY EXECUTION--> +<!-- ==================================================================== --> + <!-- NULLS --> + <section> +<title>Null Values</title> +<p>Pig handles null values differently for the GROUP/COGROUP and JOIN operations.</p> + + <section> +<title>GROUP/COGROUP and Null Values</title> + +<p>When using the GROUP operator with a single relation, records with a null group key are grouped together.</p> +<source> +a = load 'student' as (name:chararray, age:int, gpa:float); +dump a; +(joe,18,2.5) +(sam,,3.0) +(bob,,3.5) + +x = group a by age; +dump x; +(18,{(joe,18,2.5)}) +(,{(sam,,3.0),(bob,,3.5)}) +</source> + +<p>When using the GROUP (COGROUP) operator with multiple relations, records with a null group key are considered different and are grouped separately. +In the example below note that there are two tuples in the output corresponding to the null group key: +one that contains tuples from relation A (but not relation B) and one that contains tuples from relation B (but not relation A).</p> + +<source> +A = load 'student' as (name:chararray, age:int, gpa:float); +B = load 'student' as (name:chararray, age:int, gpa:float); +dump B; +(joe,18,2.5) +(sam,,3.0) +(bob,,3.5) + +X = cogroup A by age, B by age; +dump X; +(18,{(joe,18,2.5)},{(joe,18,2.5)}) +(,{(sam,,3.0),(bob,,3.5)},{}) +(,{},{(sam,,3.0),(bob,,3.5)}) +</source> </section> + + <section> +<title>JOIN and Null Values</title> +<p>The JOIN operator - when performing inner joins - adheres to the SQL standard and disregards (filters out) null values.</p> + <source> +A = load 'student' as (name:chararray, age:int, gpa:float); +B = load 'student' as (name:chararray, age:int, gpa:float); +dump B; +(joe,18,2.5) +(sam,,3.0) +(bob,,3.5) + +X = join A by age, B by age; +dump X; +(joe,18,2.5,joe,18,2.5) +</source> </section> + </section> + <!-- END NULLS --> + +<!-- ==================================================================== --> + + <!-- OPTIMIZATION RULES --> +<section> +<title>Optimization Rules</title> +<p>Pig supports various optimization rules. By default optimization, and all optimization rules, are turned on. +To turn off optimiztion, use:</p> + +<source> +pig -optimizer_off [opt_rule | all ] +</source> + +<p>Note that some rules are mandatory and cannot be turned off.</p> + +<section> +<title>ImplicitSplitInserter</title> +<p>Status: Mandatory</p> +<p> +<a href="piglatin_ref2.html#SPLIT">SPLIT</a> is the only operator that models multiple outputs in Pig. +To ease the process of building logical plans, all operators are allowed to have multiple outputs. As part of the +optimization, all non-split operators that have multiple outputs are altered to have a SPLIT operator as the output +and the outputs of the operator are then made outputs of the SPLIT operator. An example will illustrate the point. +Here, a split will be inserted after the LOAD and the split outputs will be connected to the FILTER (b) and the COGROUP (c). +</p> +<source> +A = LOAD 'input'; +B = FILTER A BY $1 == 1; +C = COGROUP A BY $0, B BY $0; +</source> </section> -<!-- END MULTI-QUERY EXECUTION--> +<section> +<title>TypeCastInserter</title> +<p>Status: Mandatory</p> +<p> +If you specify a <a href="piglatin_ref2.html#Schemas">schema</a> with the +<a href="piglatin_ref2.html#LOAD">LOAD</a> statement, the optimizer will perform a pre-fix projection of the columns +and <a href="piglatin_ref2.html#Cast+Operators">cast</a> the columns to the appropriate types. An example will illustrate the point. +The LOAD statement (a) has a schema associated with it. The optimizer will insert a FOREACH operator that will project columns 0, 1 and 2 +and also cast them to chararray, int and float respectively. +</p> +<source> +A = LOAD 'input' AS (name: chararray, age: int, gpa: float); +B = FILER A BY $1 == 1; +C = GROUP A By $0; +</source> +</section> + +<section> +<title>StreamOptimizer</title> +<p> +Optimize when <a href="piglatin_ref2.html#LOAD">LOAD</a> precedes <a href="piglatin_ref2.html#STREAM">STREAM</a> +and the loader class is the same as the serializer for the stream. Similarly, optimize when STREAM is followed by +<a href="piglatin_ref2.html#STORE">STORE</a> and the deserializer class is same as the storage class. +For both of these cases the optimization is to replace the loader/serializer with BinaryStorage which just moves bytes +around and to replace the storer/deserializer with BinaryStorage. +</p> + +</section> +<section> +<title>OpLimitOptimizer</title> +<p> +The objective of this rule is to push the <a href="piglatin_ref2.html#LIMIT">LIMIT</a> operator up the data flow graph +(or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY. +</p> +<source> +A = LOAD 'input'; +B = ORDER A BY $0; +C = LIMIT B 10; +</source> +</section> + +<section> +<title>PushUpFilters</title> +<p> +The objective of this rule is to push the <a href="piglatin_ref2.html#FILTER">FILTER</a> operators up the data flow graph. +As a result, the number of records that flow through the pipeline is reduced. +</p> +<source> +A = LOAD 'input'; +B = GROUP A BY $0; +C = FILTER B BY $0 < 10; +</source> +</section> + +<section> +<title>PushDownExplodes</title> +<p> +The objective of this rule is to reduce the number of records that flow through the pipeline by moving +<a href="piglatin_ref2.html#FOREACH">FOREACH</a> operators with a +<a href="piglatin_ref2.html#Flatten+Operator">FLATTEN</a> down the data flow graph. +In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation. +</p> +<source> +A = LOAD 'input' AS (a, b, c); +B = LOAD 'input2' AS (x, y, z); +C = FOREACH A GENERATE FLATTEN($0), B, C; +D = JOIN C BY $1, B BY $1; +</source> +</section> +</section> <!-- END OPTIMIZATION RULES --> + + +<!-- ==================================================================== --> <!-- SPECIALIZED JOINS--> <section> <title>Specialized Joins</title> <p> -Pig Latin includes three "specialized" joins: replicated joins, skewed joins, and merge joins. </p> -<ul> -<li>Replicated, skewed, and merge joins can be performed using <a href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a>.</li> -<li>Replicated and skewed joins can also be performed using <a href="piglatin_ref2.html#JOIN+%28outer%29">outer joins</a>.</li> -</ul> +In certain cases, the performance of <a href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a> and <a href="piglatin_ref2.html#JOIN+%28outer%29">outer joins</a> +can be optimized using replicated, skewed, or merge joins. </p> + <!-- FRAGMENT REPLICATE JOINS--> <section> @@ -618,48 +792,61 @@ assumed to have been done (see the Condi Pig implements the merge join algorithm by selecting the left input of the join to be the input file for the map phase, and the right input of the join to be the side file. It then samples records from the right input to build an index that contains, for each sampled record, the key(s) the filename and the offset into the file the record - begins at. This sampling is done in an initial map only job. A second MapReduce job is then initiated, + begins at. This sampling is done in the first MapReduce job. A second MapReduce job is then initiated, with the left input as its input. Each map uses the index to seek to the appropriate record in the right input and begin doing the join. </p> <section> <title>Usage</title> -<p>Perform a merge join with the USING clause (see <a href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a>).</p> +<p>Perform a merge join with the USING clause (see <a href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a> and <a href="piglatin_ref2.html#JOIN+%28outer%29">outer joins</a>). </p> <source> -C = JOIN A BY a1, B BY b1 USING 'merge'; +C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge'; </source> </section> <section> <title>Conditions</title> -<p> -Merge join will only work under these conditions: -</p> - +<p><strong>Condition A</strong></p> +<p>Inner merge join (between two tables) will only work under these conditions: </p> <ul> -<li>Both inputs are sorted in *ascending* order of join keys. If an input consists of many files, there should be -a total ordering across the files in the *ascending order of file name*. So for example if one of the inputs to the -join is a directory called input1 with files a and b under it, the data should be sorted in ascending order of join -key when read starting at a and ending in b. Likewise if an input directory has part files part-00000, part-00001, -part-00002 and part-00003, the data should be sorted if the files are read in the sequence part-00000, part-00001, -part-00002 and part-00003. </li> -<li>The merge join only has two inputs </li> -<li>The loadfunc for the right input of the join should implement the OrderedLoadFunc interface (PigStorage does -implement the OrderedLoadFunc interface). </li> -<li>Only inner join will be supported </li> - <li>Between the load of the sorted input and the merge join statement there can only be filter statements and foreach statement where the foreach statement should meet the following conditions: <ul> -<li>There should be no UDFs in the foreach statement </li> -<li>The foreach statement should not change the position of the join keys </li> -<li>There should not transformation on the join keys which will change the sort order </li> +<li>There should be no UDFs in the foreach statement. </li> +<li>The foreach statement should not change the position of the join keys. </li> +<li>There should be no transformation on the join keys which will change the sort order. </li> </ul> </li> +<li>Data must be sorted on join keys in ascending (ASC) order on both sides.</li> +<li>Right-side loader must implement either the {OrderedLoadFunc} interface or {IndexableLoadFunc} interface.</li> +<li>Type information must be provided for the join key in the schema.</li> +</ul> +<p></p> +<p>The Zebra and PigStorage loaders satisfy all of these conditions.</p> +<p></p> +<p><strong>Condition B</strong></p> +<p>Outer merge join (between two tables) and inner merge join (between three or more tables) will only work under these conditions: </p> +<ul> +<li>No other operations can be done between the load and join statements. </li> +<li>Data must be sorted on join keys in ascending (ASC) order on both sides. </li> +<li>Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. </li> +<li>All other loaders must implement {IndexableLoadFunc}. </li> +<li>Type information must be provided for the join key in the schema.</li> </ul> <p></p> +<p>The Zebra loader satisfies all of these conditions.</p> + +<p>An example of a left outer merge join using the Zebra loader:</p> +<source> +A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); +B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); +C = join A by id left, B by id using 'merge'; +</source> + +<p></p> +<p><strong>Both Conditions</strong></p> <p> For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be less than 128 MB). @@ -668,138 +855,22 @@ If the total input size (including all p job performing the merge-join will process. </p> -<p> -In local mode, merge join will revert to regular join. -</p> </section> </section><!-- END MERGE JOIN --> - </section> <!-- END SPECIALIZED JOINS--> - - <!-- OPTIMIZATION RULES --> -<section> -<title>Optimization Rules</title> -<p>Pig supports various optimization rules. By default optimization, and all optimization rules, are turned on. -To turn off optimiztion, use:</p> - -<source> -pig -optimizer_off [opt_rule | all ] -</source> -<p>Note that some rules are mandatory and cannot be turned off.</p> - -<section> -<title>ImplicitSplitInserter</title> -<p>Status: Mandatory</p> -<p> -<a href="piglatin_ref2.html#SPLIT">SPLIT</a> is the only operator that models multiple outputs in Pig. -To ease the process of building logical plans, all operators are allowed to have multiple outputs. As part of the -optimization, all non-split operators that have multiple outputs are altered to have a SPLIT operator as the output -and the outputs of the operator are then made outputs of the SPLIT operator. An example will illustrate the point. -Here, a split will be inserted after the LOAD and the split outputs will be connected to the FILTER (b) and the COGROUP (c). -</p> -<source> -A = LOAD 'input'; -B = FILTER A BY $1 == 1; -C = COGROUP A BY $0, B BY $0; -</source> -</section> - -<section> -<title>TypeCastInserter</title> -<p>Status: Mandatory</p> -<p> -If you specify a <a href="piglatin_ref2.html#Schemas">schema</a> with the -<a href="piglatin_ref2.html#LOAD">LOAD</a> statement, the optimizer will perform a pre-fix projection of the columns -and <a href="piglatin_ref2.html#Cast+Operators">cast</a> the columns to the appropriate types. An example will illustrate the point. -The LOAD statement (a) has a schema associated with it. The optimizer will insert a FOREACH operator that will project columns 0, 1 and 2 -and also cast them to chararray, int and float respectively. -</p> -<source> -A = LOAD 'input' AS (name: chararray, age: int, gpa: float); -B = FILER A BY $1 == 1; -C = GROUP A By $0; -</source> -</section> - -<section> -<title>StreamOptimizer</title> -<p> -Optimize when <a href="piglatin_ref2.html#LOAD">LOAD</a> precedes <a href="piglatin_ref2.html#STREAM">STREAM</a> -and the loader class is the same as the serializer for the stream. Similarly, optimize when STREAM is followed by -<a href="piglatin_ref2.html#STORE">STORE</a> and the deserializer class is same as the storage class. -For both of these cases the optimization is to replace the loader/serializer with BinaryStorage which just moves bytes -around and to replace the storer/deserializer with BinaryStorage. -</p> - -</section> - -<section> -<title>OpLimitOptimizer</title> -<p> -The objective of this rule is to push the <a href="piglatin_ref2.html#LIMIT">LIMIT</a> operator up the data flow graph -(or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY. -</p> -<source> -A = LOAD 'input'; -B = ORDER A BY $0; -C = LIMIT B 10; -</source> -</section> - -<section> -<title>PushUpFilters</title> -<p> -The objective of this rule is to push the <a href="piglatin_ref2.html#FILTER">FILTER</a> operators up the data flow graph. -As a result, the number of records that flow through the pipeline is reduced. -</p> -<source> -A = LOAD 'input'; -B = GROUP A BY $0; -C = FILTER B BY $0 < 10; -</source> -</section> - -<section> -<title>PushDownExplodes</title> -<p> -The objective of this rule is to reduce the number of records that flow through the pipeline by moving -<a href="piglatin_ref2.html#FOREACH">FOREACH</a> operators with a -<a href="piglatin_ref2.html#Flatten+Operator">FLATTEN</a> down the data flow graph. -In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation. -</p> -<source> -A = LOAD 'input' AS (a, b, c); -B = LOAD 'input2' AS (x, y, z); -C = FOREACH A GENERATE FLATTEN($0), B, C; -D = JOIN C BY $1, B BY $1; -</source> -</section> -</section> <!-- END OPTIMIZATION RULES --> - - <!-- MEMORY MANAGEMENT --> -<section> -<title>Memory Management</title> - -<p>Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p> - -<p>The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 10% of available memory. Note that this memory is shared across all large bags used by the application.</p> - -</section> -<!-- END MEMORY MANAGEMENT --> +<!-- ==================================================================== --> - <!-- ZEBRA INTEGRATION --> +<!-- ZEBRA INTEGRATION --> <section> <title>Zebra Integration</title> <p>For information about how to integrate Zebra with your Pig scripts, see <a href="zebra_pig.html">Zebra and Pig</a>.</p> </section> <!-- END ZEBRA INTEGRATION --> - - </body> </document>