piglatin.xml

breed Fri, 21 Aug 2009 11:35:22 -0700

Author: breed
Date: Fri Aug 21 18:34:57 2009
New Revision: 806668

URL: http://svn.apache.org/viewvc?rev=806668&view=rev
Log:
PIG-812. COUNT(*) does not work


Modified:
    hadoop/pig/trunk/CHANGES.txt
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml

Modified: hadoop/pig/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/CHANGES.txt?rev=806668&r1=806667&r2=806668&view=diff
==============================================================================
--- hadoop/pig/trunk/CHANGES.txt (original)
+++ hadoop/pig/trunk/CHANGES.txt Fri Aug 21 18:34:57 2009
@@ -28,6 +28,8 @@
 
 IMPROVEMENTS
 
+PIG-812: COUNT(*) does not work (breed)
+
 PIG-923: Allow specifying log file location through pig.properties (dvryaboy 
through daijy)
 
 PIG-926: Merge-Join phase 2 (ashutoshc via pradeepkth)

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml?rev=806668&r1=806667&r2=806668&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml 
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml Fri 
Aug 21 18:34:57 2009
@@ -1574,6 +1574,45 @@
       </listitem>
    </itemizedlist>
 
+      <section id="fexp">
+          <title>Field expressions</title>
+          <para>Field expressions represent a field or a dereference operator 
applied to a field. See <xref linkend="deref" /> for more details.</para>
+      </section>
+
+      <section id="sexp">
+          <title>Star expression</title>
+          <para>The star symbol, *, can be used to represent all the fields of 
a tuple. It is equivalent to writing out the fields explicitly. In the 
following example the definition of B and C are exactly the same, and MyUDF 
will be invoked with exactly the same arguments in both cases.</para>
+          <programlisting>
+A = LOAD 'data' USING MyStorage() AS (name:chararray, age: int);
+B = FOREACH A GENERATE *, MyUDF(name, age);
+C = FOREACH A GENERATE name, age, MyUDF(*);
+          </programlisting>
+          <para>A common error when using the star expression is the 
following:</para>
+          <programlisting>
+G = GROUP A BY $0;
+C = FOREACH G GENERATE COUNT(*)
+          </programlisting>
+          <para>In this example, the programmer really wants to count the 
number of elements in the bag in the second field: COUNT($1).</para>
+      </section>
+
+      <section id="bexp">
+          <title>Boolean expressions</title>
+          <para>Boolean expressions can be made up of UDFs that return a 
boolean value or boolean operators (see <xref linkend="boolops" />). 
+          </para>
+      </section>
+           
+      <section id="texp">
+          <title>Tuple expressions</title>
+          <para>Tuple expressions form subexpressions into tuples. The tuple 
expression has the form (expression [, expression â¦]), where expression is a 
general expression. The simplest tuple expression is the star expression, which 
represents all fields.
+          </para>
+      </section>
+
+      <section id="gexp">
+          <title>General expressions</title>
+          <para>General expressions can be made up of UDFs and almost any 
operator. Since Pig does not consider boolean a base type, the result of a 
general expression cannot be a boolean. Field expressions are the simpliest 
general expressions.
+          </para>
+      </section>
+
    </section>
    
    <section>
@@ -4000,7 +4039,7 @@
    <title>Types Table</title>
    <para>The null operators can be applied to all data types. For more 
information, see Nulls.</para></section></section>
    
-   <section>
+   <section id="boolops">
    <title>Boolean Operators</title>
       
       <section>
@@ -4061,7 +4100,7 @@
    
    </section></section></section>
    
-   <section>
+   <section id="deref">
    <title>Dereference Operators</title>
    
    <section>
@@ -4083,10 +4122,10 @@
                <para>tuple dereference Â  Â  Â </para>
             </entry>
             <entry>
-               <para>. (dot)</para>
+               <para>tuple.id or tuple.(id,â¦)</para>
             </entry>
             <entry>
-               <para>Retrieve a field from a tuple. </para>
+               <para>Tuple dereferencing can be done by name 
(tuple.field_name) or position (mytuple.$0). If a set of fields are 
dereferenced (tuple.(name1, name2) or tuple.($0, $1)), the expression 
represents a tuple composed of the specified fields. Note that if the dot 
operator is applied to a bytearray, the bytearray will be assumed to be a 
tuple.</para>
             </entry>
          </row>
          <row>
@@ -4094,10 +4133,10 @@
                <para>bag dereference</para>
             </entry>
             <entry>
-               <para>. (dot)</para>
+               <para>bag.id or bag.(id,â¦)</para>
             </entry>
             <entry>
-               <para>Retrieve a column from a bag. </para>
+               <para>Bag dereferencing can be done by name (bag.field_name) or 
position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or 
bag.($0, $1)), the expression represents a bag composed of the specified 
fields.</para>
             </entry>
          </row>
          <row>
@@ -4105,25 +4144,13 @@
                <para>map dereference</para>
             </entry>
             <entry>
-               <para># </para>
+               <para>map#'key'</para>
             </entry>
             <entry>
-               <para>For a key#value pair, look up the value for the specified 
key.</para>
+               <para>Map dereferencing must be done by key (field_name#key or 
$0#key). If the pound operator is applied to a bytearray, the bytearray is 
assumed to be a map. If the key does not exist, the empty string is 
returned.</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable>
-   <para>Note the following:</para>
-   <itemizedlist>
-      <listitem>
-         <para>Tuple dereferencing can be done by name (tuple.field_name) or 
position (mytuple.$0). Note that if the dot operator is applied to a bytearray, 
the bytearray will be assumed to be a tuple.</para>
-      </listitem>
-      <listitem>
-         <para>Bag dereferencing can be done by name (bag.field_name) or 
position (bag.$0). </para>
-      </listitem>
-      <listitem>
-         <para>Map dereferencing must be done by key (field_name#key or 
$0#key). If the pound operator is applied to a bytearray, the bytearray is 
assumed to be a map. If the key does not exist, the empty string is 
returned.</para>
-      </listitem>
-   </itemizedlist>
    
    <section>
    <title>Example: Tuple</title>
@@ -4400,6 +4427,15 @@
    </informaltable></section></section></section>
    
    <section>
+   <title>Flatten Operator</title>
+   <para>The flatten operator looks like a UDF syntactically, but it is 
actually an operator that changes the structure of tuples and bags in a way 
that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the 
same, but the operation and result is different for each type of 
structure.</para>
+
+   <para>For tuples, flatten substitutes the fields of a tuple in place of the 
tuple. For example, consider a relation that has a tuple of the form (a, (b, 
c)). The expression GENERATE $0, flatten($1), will cause that tuple to become 
(a, b, c).</para>
+
+   <para>For bags, the situation becomes more complicated. When we un-nest a 
bag, we create new tuples. If we have a relation that is made up of tuples of 
the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two 
tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes 
we cause a cross product to happen. For example, consider a relation that has a 
tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. 
If we apply the expression GENERATE $0, flatten($1) to this tuple, we will 
create new tuples: (a, b, c) and (a, d, e).</para>
+   </section>
+
+   <section>
    <title>Cast Operators</title>
    
    <section>
@@ -4765,7 +4801,7 @@
             </entry>
          </row></tbody></tgroup>
    </informaltable>
-   
+
    <section>
    <title>Syntax Â </title>
    <informaltable frame="all">
@@ -4970,148 +5006,8 @@
 
 <section>
 <title>COGROUP</title>
-   <para>Groups the data in two or more relations.</para>
-   
-   <section>
-   <title>Syntax</title>
-   <informaltable frame="all">
-      <tgroup cols="1"><tbody><row>
-            <entry>
-               <para>alias =COGROUP alias BY field_alias [INNER | OUTER] , 
alias Â BY field_alias [INNER | OUTER] [PARALLEL n] ;</para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Terms</title>
-   <informaltable frame="all">
-      <tgroup cols="2"><tbody><row>
-            <entry>
-               <para>alias</para>
-            </entry>
-            <entry>
-               <para>The name a relation.</para>
-            </entry>
-         </row>
-         <row>
-            <entry>
-               <para>field_alias</para>
-            </entry>
-            <entry>
-               <para>The name of one or more fields in a relation. </para>
-               <para>If multiple fields are specified, separate with commas 
and enclose in parentheses. For example, X = COGROUP A BY (f1, f2);</para>
-               <para>The number of fields specified in each BY clause must 
match. For example, X = COGROUP A BY (a1,a2,a3), B BY (b1,b2,b3);</para>
-            </entry>
-         </row>
-         <row>
-            <entry>
-               <para>BY</para>
-            </entry>
-            <entry>
-               <para>Keyword.</para>
-            </entry>
-         </row>
-         <row>
-            <entry>
-               <para>INNER</para>
-            </entry>
-            <entry>
-               <para>Keyword. </para>
-            </entry>
-         </row>
-         <row>
-            <entry>
-               <para>OUTER</para>
-            </entry>
-            <entry>
-               <para>Keyword.</para>
-            </entry>
-         </row>
-         <row>
-            <entry>
-               <para>PARALLEL n</para>
-            </entry>
-            <entry>
-               <para>Increase the parallelism of a job by specifying the 
number of reduce tasks, n. The optimal number of parallel tasks depends on the 
amount of memory on each node and the memory required by each of the tasks. To 
determine n, use the following as a general guideline:</para>
-               <para/>
-               <para>Â  Â n = (nr_nodes - 1) * 0.45 * nr_GB</para>
-               <para/>
-               <para>where nr_nodes is the number of nodes used and nr_GB is 
the amount of Â physical memory on each node.</para>
-               <para>Note the following:</para>
-               <itemizedlist>
-                  <listitem>
-                     <para>Parallel only affects the number of reduce tasks. 
Map parallelism is determined by the input file, one map for each HDFS block. 
</para>
-                  </listitem>
-                  <listitem>
-                     <para>If you donât specify parallel, you still get the 
same map parallelism but only one reduce task. </para>
-                  </listitem>
-               </itemizedlist>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Usage</title>
-   <para>The COGOUP operator groups the data in two or more relations based on 
the common field values. Note that the COGROUP and JOIN operators perform 
similar functions. COGROUP creates a nested set of output tuples while JOIN 
creates a flat set of output tuples.</para></section>
-   
-   <section>
-   <title>Examples</title>
-   <para>Suppose we have two relations, A and B.</para>
-<programlisting>
-A = LOAD 'data1' AS (owner:chararray,pet:chararray);
-
-DUMP A;
-(Alice,turtle)
-(Alice,goldfish)
-(Alice,cat)
-(Bob,dog)
-(Bob,cat)
-
-B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
-
-DUMP B;
-(Cindy,Alice)
-(Mark,Alice)
-(Paul,Bob)
-(Paul,Jane)
-</programlisting>
-   
-   <para>In this example tuples are co-grouped using field âownerâ from 
relation A and field âfriend2â from relation B as the key fields. The 
DESCRIBE operator shows the schema for relation X, which has two fields, 
"group" and "A" (see the GROUP operator for information about the field 
names).</para>
-<programlisting>
-X = COGROUP A BY owner, B BY friend2;
-
-DESCRIBE X;
-X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1: 
chararray,friend2: chararray}}
-</programlisting>
-   
-   <para>Relation X looks like this. A tuple is created for each unique key 
field. The tuple includes the key field and two bags. The first bag is the 
tuples from the first relation with the matching key field. The second bag is 
the tuples from the second relation with the matching key field. If no tuples 
match the key field, the bag is empty.</para>
-<programlisting>
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-(Jane,{},{(Paul,Jane)})
-</programlisting>
-   
-   <para>In this example tuples are co-grouped and the INNER keyword is used 
to ensure that only bags with at least one tuple are returned. </para>
-<programlisting>
-X = COGROUP A BY owner INNER, B BY friend2 INNER;
-
-DUMP X;
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-</programlisting>
-   
-   <para>In this example tuples are co-grouped and the INNER keyword is used 
asymmetrically on only one of the relations.</para>
-<programlisting>
-X = COGROUP A BY owner, B BY friend2 INNER;
-
-DUMP X;
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-(Jane,{},{(Paul,Jane)})
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-</programlisting>
-   
-   </section></section>
-   
+   <para>COGROUP is the same as GROUP, but for readability purposes 
programmers usually use GROUP when only one relation is involved and COGROUP 
with multiple relations. See <xref linkend="GROUP" /> for more 
information.</para>
+</section>
    <section>
    <title>CROSS</title>
    <para>Computes the cross product of two or more relations.</para>
@@ -5367,7 +5263,7 @@
                <para>expression</para>
             </entry>
             <entry>
-               <para>An expression.</para>
+               <para>A boolean expression.</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5459,7 +5355,7 @@
                <para/>
                <para>alias = FOREACH nested_alias {</para>
                <para>Â  Â alias = nested_op; [alias = nested_op; â¦]</para>
-               <para>Â  Â GENERATE expression [expression â¦.]</para>
+               <para>Â  Â GENERATE expression [, expression â¦]</para>
                <para>};</para>
                <para/>
                <para>Where:</para>
@@ -5759,16 +5655,16 @@
    
    </section></section>
    
-   <section>
+   <section id="GROUP">
    <title>GROUP</title>
-   <para>Groups the data in a single relation.</para>
+   <para>Groups the data in a one or multiple relations. For readability 
COGROUP is usually used with multiple relations and group is used with a single 
relation, but they are the same operator.</para>
    
    <section>
    <title>Syntax</title>
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = GROUP alias { [ALL] | [BY {[field_alias [, 
field_alias]] | * | [expression]] }Â [PARALLEL n];</para>
+               <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL 
| BY expression â¦] [PARALLEL n];</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5804,27 +5700,10 @@
          </row>
          <row>
             <entry>
-               <para>field_alias</para>
-            </entry>
-            <entry>
-               <para>The name of a field in a relation. This is the group key 
or key field. </para>
-               <para>A relation can be grouped by a single field (f1) or by 
the composite value of multiple fields (f1,f2).</para>
-            </entry>
-         </row>
-         <row>
-            <entry>
-               <para>*</para>
-            </entry>
-            <entry>
-               <para>The designator for a tuple.</para>
-            </entry>
-         </row>
-         <row>
-            <entry>
                <para>expression</para>
             </entry>
             <entry>
-               <para>An expression.</para>
+               <para>A tuple expression. This is the group key or key field. 
If the result of the tuple expression is a single field, the key will be the 
value of the first field rather than a tuple with one field.</para>
             </entry>
          </row>
 
@@ -5853,7 +5732,7 @@
    
    <section>
    <title>Usage</title>
-   <para>The GROUP operator groups together tuples that have the same group 
key (key field). The result of a GROUP operation is a relation that includes 
one tuple per group. This tuple contains two fields: </para>
+   <para>The GROUP operator groups together tuples that have the same group 
key (key field). The key field will be a tuple if the group key has more than 
one field, otherwise it will be the same type as that of the group key. The 
result of a GROUP operation is a relation that includes one tuple per group. 
This tuple contains two fields: </para>
    <itemizedlist>
       <listitem>
          <para>The first field is named "group" (do not confuse this with the 
GROUP operator) and is the same type of the group key.</para>
@@ -5863,7 +5742,11 @@
          <para/>
          <para>The names of both fields are generated by the system as shown 
in the example below.</para>
       </listitem>
-   </itemizedlist></section>
+   </itemizedlist>
+   <para>
+   Note that the GROUP (and thus COGROUP) and JOIN operators perform similar 
functions. GROUP creates a nested set of output tuples while JOIN creates a 
flat set of output tuples.
+   </para>
+   </section>
    
    <section>
    <title>Example</title>
@@ -5923,10 +5806,6 @@
 (20,{(Bill)})
 </programlisting>
    
-   </section></section>
-   
-   <section>
-   <title>Example</title>
    <para>Suppose we have relation A.</para>
 <programlisting>
 A = LOAD 'data' as (f1:chararray, f2:int, f3:int);
@@ -5946,6 +5825,62 @@
 (2,{(r1,1,2),(r2,2,1)})
 (16,{(r3,2,8),(r4,4,4)})
 </programlisting>
+
+   <para>Suppose we have two relations, A and B.</para>
+<programlisting>
+A = LOAD 'data1' AS (owner:chararray,pet:chararray);
+
+DUMP A;
+(Alice,turtle)
+(Alice,goldfish)
+(Alice,cat)
+(Bob,dog)
+(Bob,cat)
+
+B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
+
+DUMP B;
+(Cindy,Alice)
+(Mark,Alice)
+(Paul,Bob)
+(Paul,Jane)
+</programlisting>
+   
+   <para>In this example tuples are co-grouped using field âownerâ from 
relation A and field âfriend2â from relation B as the key fields. The 
DESCRIBE operator shows the schema for relation X, which has two fields, 
"group" and "A" (see the GROUP operator for information about the field 
names).</para>
+<programlisting>
+X = COGROUP A BY owner, B BY friend2;
+
+DESCRIBE X;
+X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1: 
chararray,friend2: chararray}}
+</programlisting>
+   
+   <para>Relation X looks like this. A tuple is created for each unique key 
field. The tuple includes the key field and two bags. The first bag is the 
tuples from the first relation with the matching key field. The second bag is 
the tuples from the second relation with the matching key field. If no tuples 
match the key field, the bag is empty.</para>
+<programlisting>
+(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
+(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
+(Jane,{},{(Paul,Jane)})
+</programlisting>
+   
+   <para>In this example tuples are co-grouped and the INNER keyword is used 
to ensure that only bags with at least one tuple are returned. </para>
+<programlisting>
+X = COGROUP A BY owner INNER, B BY friend2 INNER;
+
+DUMP X;
+(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
+(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
+</programlisting>
+   
+   <para>In this example tuples are co-grouped and the INNER keyword is used 
asymmetrically on only one of the relations.</para>
+<programlisting>
+X = COGROUP A BY owner, B BY friend2 INNER;
+
+DUMP X;
+(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
+(Jane,{},{(Paul,Jane)})
+(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
+</programlisting>
+   
+   </section>
    </section>
    
    <section>
@@ -5957,7 +5892,7 @@
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = JOIN alias BY field_alias, alias BY field_alias 
[, alias BY field_alias â¦] [USING "replicated"] [PARALLEL n];Â  </para>
+               <para>alias = JOIN alias BY {expression|'('expression [, 
expression â¦]')'} (, alias BY {expression|'('expression [, expression 
â¦]')'} â¦) [USING "replicated"] [PARALLEL n];Â  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5983,10 +5918,10 @@
          </row>
          <row>
             <entry>
-               <para>field_alias</para>
+               <para>expression</para>
             </entry>
             <entry>
-               <para>The name of a field in a relation. For the BY clause, 
field_alias must be in alias.</para>
+               <para>A field expression.</para>
                <para>Example: X = JOIN A BY fieldA, B BY fieldB, C BY 
fieldC;</para>
             </entry>
          </row>
@@ -9848,4 +9783,4 @@
    </section>   
    </article>
   
-   
\ No newline at end of file
+

svn commit: r806668 - in /hadoop/pig/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/piglatin.xml

Reply via email to