Author: olga Date: Fri Dec 4 00:51:50 2009 New Revision: 887019 URL: http://svn.apache.org/viewvc?rev=887019&view=rev Log: PIG-1084: Pig 0.6.0 Documentation improvements (chandec via olgan)
Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/CHANGES.txt?rev=887019&r1=887018&r2=887019&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/CHANGES.txt (original) +++ hadoop/pig/branches/branch-0.6/CHANGES.txt Fri Dec 4 00:51:50 2009 @@ -24,6 +24,8 @@ IMPROVEMENTS +PIG-1084: Pig 0.6.0 Documentation improvements (chandec via olgan) + PIG-978: MQ docs update (chandec via olgan) PIG-872: use distributed cache for the replicated data set in FR join Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=887019&r1=887018&r2=887019&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml Fri Dec 4 00:51:50 2009 @@ -64,7 +64,7 @@ <section> <title>Project Early and Often </title> -<p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p> +<p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p> <source> A = load 'myfile' as (t, u, v); @@ -74,7 +74,7 @@ E = foreach D generate group, COUNT($1); </source> -<p>There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the above query to </p> +<p>There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. </p> <source> A = load 'myfile' as (t, u, v); @@ -87,8 +87,7 @@ E = foreach D generate group, COUNT($1); </source> -<p>will greatly reduce the amount of data being carried through the map and reduce phases by pig. Depending on your data, this can produce significant time savings. In queries similar to the example given we have seen total time drop by 50%. </p> - +<p>Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.</p> </section> <section> @@ -147,12 +146,9 @@ </section> <section> -<title>Make your UDFs Algebraic</title> +<title>Make Your UDFs Algebraic</title> -<p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. -The latest code significantly improves combiner usage; however, you need to make sure you do your part. -If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) -make sure you implement it as such. For details on how to write algebraic UDFs, see the <a href="udf.html">Pig UDF Manual</a>. </p> +<p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see the Pig UDF Manual and <a href="udf.html#Aggregate+Functions">Aggregate Functions</a>.</p> <source> A = load 'data' as (x, y, z) @@ -162,12 +158,18 @@ </source> <p>If <code>MyUDF</code> is algebraic, the query will use combiner and run much faster. You can run <code>explain</code> command on your query to make sure that combiner is used. </p> +</section> +<section> +<title>Implement the Aggregator Interface</title> +<p> +If your UDF can't be made Algebraic but is able to deal with getting input in chunks rather than all at once, consider implementing the Aggregator interface to reduce the amount of memory used by your script. If your function <em>is</em> Algebraic and can be used on conjunction with Accumulator functions, you will need to implement the Accumulator interface as well as the Algebraic interface. For more information, see the Pig UDF Manual and <a href="udf.html#Accumulator+Interface">Accumulator Interface</a>. +</p> </section> + <section> <title>Drop Nulls Before a Join</title> -<p>This comment only applies to pig 0.2.0 branch, as pig 0.1.0 does not have nulls. </p> <p>With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row, in a standard join the rows with a null key will always be dropped. The join: </p> <source> @@ -199,10 +201,13 @@ </section> <section> -<title>Take Advantage of Join Optimization</title> +<title>Take Advantage of Join Optimizations</title> -<p>The optimization insures that the last table in the join is not brought into memory but stream through instead. The optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p> -<p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query. </p> +<section> +<title>Regular Join Optimizations</title> +<p>Optimization for regular joins ensures that the last table in the join is not brought into memory but streamed through instead. Optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p> +<p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query. +In some of our tests we saw 10x performance improvement as the result of this optimization.</p> <source> small = load 'small_file' as (t, u, v); @@ -210,34 +215,36 @@ C = join small by t, large by x; </source> -<p>In some of our tests we saw 10x performance improvement as the result of this optimization. </p> </section> <section> -<title>Use Fragment Replicate Join</title> +<title>Specialized Join Optimizations</title> +<p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins. +For more information see <a href="piglatin_users.html#Specialized+Joins">Specialized Joins</a>.</p> +</section> -<p>This type of join works well if one of tables is small enough to fit into main memory. In this case, Pig can perform a very efficient join since, in hadoop world, it can be done completely on the map side. </p> +</section> -<source> -tiny = load 'small_file' as (t, u, v); -large = load 'large_file' as (x, y, z); -C = join large by x, tiny by t using "replicated"; -</source> -<p>Note that the large table must come first followed by one or more small tables. All small tables together must fit into main memory. </p> -<p>This feature is new and experimental. It is experimental because we don't have a strong sense of how small the small table must be to fit into memory. In our tests with a simple query that involved just join a table of up to 100M can be used if the process overall gets 1 GB of memory. If the table does not fit into memory, the process would fail and generate an error. </p> +<section> +<title>Use the PARALLEL Keyword</title> -</section> +<p>PARALLEL controls the number of reducers invoked by Hadoop. The default value is 1. However, the number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 500 MB of data behaves efficiently.</p> -<section> -<title>Use PARALLEL Keyword</title> +<p>The keyword makes sense with any operator that starts a reduce phase. This includes +<a href="piglatin_reference.html#COGROUP">COGROUP</a>, +<a href="piglatin_reference.html#CROSS">CROSS</a>, +<a href="piglatin_reference.html#DISTINCT">DISTINCT</a>, +<a href="piglatin_reference.html#GROUP">GROUP</a>, +<a href="piglatin_reference.html#JOIN">JOIN</a>, +<a href="piglatin_reference.html#ORDER">ORDER</a>, and +<a href="piglatin_reference.html#JOIN%2C+OUTER">OUTER JOIN</a>. -<p>PARALLEL controls the number of reducers invoked by Hadoop. The default out of the box is 1 which, in most cases, is not what you want. I reasonable heuristic to use is something like </p> +</p> -<source><num machines> * <num reduce slots per machine> * 0.9</source> +<p>You can set the value of PARALLEL in your scripts in conjunction with the operator (see the example below). You can also set the value of PARALLEL for all scripts using the <a href="piglatin_reference.html#set">set</a> command.</p> -<p>The keyword makes sense on any operator that starts a reduce phase. This includes GROUP, COGROUP, JOIN, DISTINCT, LIMIT, ORDER BY. </p> -<p>Example: </p> +<p>Example</p> <source> A = load 'myfile' as (t, u, v); @@ -248,7 +255,7 @@ </section> <section> -<title>Use LIMIT</title> +<title>Use the LIMIT Operator</title> <p>A lot of the times, you are not interested in the entire output but either a sample or top results. In those cases, using LIMIT can yeild a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. </p> <p>Sample: Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml?rev=887019&r1=887018&r2=887019&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml Fri Dec 4 00:51:50 2009 @@ -5451,7 +5451,8 @@ <para>expression</para> </entry> <entry> - <para>A tuple expression. This is the group key or key field. If the result of the tuple expression is a single field, the key will be the value of the first field rather than a tuple with one field.</para> + <para>A tuple expression. This is the group key or key field. If the result of the tuple expression is a single field, the key will be the value of the first field rather than a tuple with one field. To group using multiple keys, enclose the keys in parentheses:</para> + <para>B = GROUP A BY (key1,key2);</para> </entry> </row> @@ -5657,6 +5658,17 @@ <section> <title>Example</title> +<para>This example shows to group using multiple keys.</para> +<programlisting> + A = LOAD 'allresults' USING PigStorage() AS (tcid:int, tpid:int, date:chararray, result:chararray, tsid:int, tag:chararray); + B = GROUP A BY (tcid, tpid); +</programlisting> + </section> + + + + <section> + <title>Example</title> <para>This example shows a map-side group.</para> <programlisting> register zebra.jar; @@ -5772,7 +5784,9 @@ <section> <title>Usage</title> <para>Use the JOIN operator to perform an inner, equijoin join of two or more relations based on common field values. - The JOIN operator always performs an inner join. Note that the JOIN and COGROUP operators perform similar functions. + The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes sense to filter them out before the join.</para> + + <para>Note that the JOIN and COGROUP operators perform similar functions. JOIN creates a flat set of output records while COGROUP creates a nested set of output records.</para> </section> @@ -9784,6 +9798,17 @@ <entry> <para>sets user-specified name for the job </para> </entry> + </row> + <row> + <entry> + <para>default_parallel</para> + </entry> + <entry> + <para>whole number </para> + </entry> + <entry> + <para>sets the number of reducers for all MapReduce jobs generated by Pig</para> + </entry> </row></tbody></tgroup> </informaltable> <para/> @@ -9795,6 +9820,7 @@ <programlisting> grunt> set debug on grunt> set job.name 'my job' +grunt> set default_parallel 100 </programlisting> </section></section> Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml?rev=887019&r1=887018&r2=887019&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml Fri Dec 4 00:51:50 2009 @@ -480,7 +480,7 @@ <p> Pig Latin includes three "specialized" joins: fragement replicate joins, skewed joins, and merge joins. Replicate, skewed, and merge joins can be performed using the <a href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins). -Replicate and skewed joins can be performed using the the the <a href="piglatin_reference.html#JOIN%2C+OUTER">Outer Join</a> syntax. +Replicate and skewed joins can also be performed using the the the <a href="piglatin_reference.html#JOIN%2C+OUTER">Outer Join</a> syntax. </p> <!-- FRAGMENT REPLICATE JOINS--> @@ -755,77 +755,6 @@ </section> <!-- END MEMORY MANAGEMENT --> - <!-- ZEBRA INTEGRATION --> -<section> -<title>Integration with Zebra</title> - <p>This version of Pig is integrated with Zebra storage format. Zebra is a recent contrib project of Pig and the details can be found at http://wiki.apache.org/pig/zebra. Pig can now: </p> - <ul> - <li>Load data in Zebra format</li> - <li>Take advantage of sorted Zebra tables in case of map-side group and merge join.</li> - <li>Store data in Zebra format</li> - </ul> - <p></p> - <p>To load data in Zebra format using TableLoader, do the following:</p> - <source> -register /zebra.jar; -A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader(); -B = FOREACH A GENERATE name, age, gpa; -</source> - - <p>There are a couple of things to note:</p> - <ol> - <li>You need to register a Aebra jar file the same way you would do it for any other UDF.</li> - <li>You need to place the jar on your classpath.</li> - <li>Zebra data is self-described and always contains schema. This means that the AS clause is unnecessary as long as - you know what the column names and types are. To determine the column names and types, you can run the DESCRIBE statement right after the load: - <source> -A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader(); -DESCRIBE A; -a: {name: chararray,age: int,gpa: float} -</source> - </li> - </ol> - -<p>You can provide alternative names to the columns with AS clause. You can also provide types as long as the - original type can be converted to the new type.</p> - -<p>You can provide multiple, comma-separated files to the loader:</p> -<source> -A = LOAD 'studenttab, votertab' USING org.apache.hadoop.zebra.pig.TableLoader(); -</source> - -<p>TableLoader supports efficient column selection. The current version of Pig does not support automatically pushing - projections down to the loader. (The work is in progress and will be done after beta.) - Meanwhile, the loader allows passing columns down via a list of arguments. This example tells the loader to only return two columns, name and age.</p> -<source> -A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader('name, age'); -</source> - -<p>If the input data is globally sorted, map-side group or merge join can be used. Please, notice the âsortedâ argument passed to the loader. This lets the loader know that the data is expected to be globally sorted and that a single key must be given to the same map.</p> - -<p>Here is an example of the merge join. Note that the first argument to the loader is left empty to indicate that all columns are requested.</p> -<source> -A = LOAD'studentsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); -B = LOAD 'votersortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); -G = JOIN A BY $0, B By $0 USING "merge"; -</source> - -<p>Here is an example of a map-side group. Note that multiple sorted files are passed to the loader and that the loader will perform sort preserving merge to make sure that the data is globally sorted.</p> -<source> -A = LOAD 'studentsortedtab, studentnullsortedtab' using org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 'sorted'); -B = GROUP A BY $0 USING "collected"; -C = FOREACH B GENERATE group, MAX(a.$1); -</source> - -<p>You can also write data in Zebra format. Note that, since Zebra requires a schema to be stored with the data, the relation that is stored must have a name assigned (via alias) to every column in the relation.</p> -<source> -A = LOAD 'studentsortedtab, studentnullsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 'sorted'); -B = GROUP A BY $0 USING "collected"; -C = FOREACH B GENERATE group, MAX(a.$1) AS max_val; -STORE C INTO 'output' USING org.apache.hadoop.zebra.pig.TableStorer(''); -</source> - - </section> <!-- END ZEBRA INTEGRATION --> Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml?rev=887019&r1=887018&r2=887019&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/udf.xml Fri Dec 4 00:51:50 2009 @@ -342,7 +342,7 @@ </tr> </table> -<p>All Pig-specific classes are available <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/"> here</a> </p> +<p>All Pig-specific classes are available <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/"> here</a>. </p> <p><code>Tuple</code> and <code>DataBag</code> are different in that they are not concrete classes but rather interfaces. This enables users to extend Pig with their own versions of tuples and bags. As a result, UDFs cannot directly instantiate bags or tuples; they need to go through factory classes: <code>TupleFactory</code> and <code>BagFactory</code>. </p> <p>The builtin <code>TOKENIZE</code> function shows how bags and tuples are created. A function takes a text string as input and returns a bag of words from the text. (Note that currently Pig bags always contain tuples.) </p> @@ -866,7 +866,7 @@ </section> <section> -<title>Accumulate Interface</title> +<title>Accumulator Interface</title> <p>In Pig, problems with memory usage can occur when data, which results from a group or cogroup operation, needs to be placed in a bag and passed in its entirety to a UDF.</p> @@ -880,7 +880,7 @@ */ public void accumulate(Tuple b) throws IOException; /** - * Called when all tuples from current key have been passed to accumulate. + * Called when all tuples from current key have been passed to the accumulator. * @return the value for the UDF for this key. */ public T getValue();