Author: breed
Date: Fri Aug 21 18:34:57 2009
New Revision: 806668
URL: http://svn.apache.org/viewvc?rev=806668&view=rev
Log:
PIG-812. COUNT(*) does not work
Modified:
hadoop/pig/trunk/CHANGES.txt
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml
Modified: hadoop/pig/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/hadoop/pig/trunk/CHANGES.txt?rev=806668&r1=806667&r2=806668&view=diff
==============================================================================
--- hadoop/pig/trunk/CHANGES.txt (original)
+++ hadoop/pig/trunk/CHANGES.txt Fri Aug 21 18:34:57 2009
@@ -28,6 +28,8 @@
IMPROVEMENTS
+PIG-812: COUNT(*) does not work (breed)
+
PIG-923: Allow specifying log file location through pig.properties (dvryaboy
through daijy)
PIG-926: Merge-Join phase 2 (ashutoshc via pradeepkth)
Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml?rev=806668&r1=806667&r2=806668&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin.xml Fri
Aug 21 18:34:57 2009
@@ -1574,6 +1574,45 @@
</listitem>
</itemizedlist>
+ <section id="fexp">
+ <title>Field expressions</title>
+ <para>Field expressions represent a field or a dereference operator
applied to a field. See <xref linkend="deref" /> for more details.</para>
+ </section>
+
+ <section id="sexp">
+ <title>Star expression</title>
+ <para>The star symbol, *, can be used to represent all the fields of
a tuple. It is equivalent to writing out the fields explicitly. In the
following example the definition of B and C are exactly the same, and MyUDF
will be invoked with exactly the same arguments in both cases.</para>
+ <programlisting>
+A = LOAD 'data' USING MyStorage() AS (name:chararray, age: int);
+B = FOREACH A GENERATE *, MyUDF(name, age);
+C = FOREACH A GENERATE name, age, MyUDF(*);
+ </programlisting>
+ <para>A common error when using the star expression is the
following:</para>
+ <programlisting>
+G = GROUP A BY $0;
+C = FOREACH G GENERATE COUNT(*)
+ </programlisting>
+ <para>In this example, the programmer really wants to count the
number of elements in the bag in the second field: COUNT($1).</para>
+ </section>
+
+ <section id="bexp">
+ <title>Boolean expressions</title>
+ <para>Boolean expressions can be made up of UDFs that return a
boolean value or boolean operators (see <xref linkend="boolops" />).
+ </para>
+ </section>
+
+ <section id="texp">
+ <title>Tuple expressions</title>
+ <para>Tuple expressions form subexpressions into tuples. The tuple
expression has the form (expression [, expression â¦]), where expression is a
general expression. The simplest tuple expression is the star expression, which
represents all fields.
+ </para>
+ </section>
+
+ <section id="gexp">
+ <title>General expressions</title>
+ <para>General expressions can be made up of UDFs and almost any
operator. Since Pig does not consider boolean a base type, the result of a
general expression cannot be a boolean. Field expressions are the simpliest
general expressions.
+ </para>
+ </section>
+
</section>
<section>
@@ -4000,7 +4039,7 @@
<title>Types Table</title>
<para>The null operators can be applied to all data types. For more
information, see Nulls.</para></section></section>
- <section>
+ <section id="boolops">
<title>Boolean Operators</title>
<section>
@@ -4061,7 +4100,7 @@
</section></section></section>
- <section>
+ <section id="deref">
<title>Dereference Operators</title>
<section>
@@ -4083,10 +4122,10 @@
<para>tuple dereference    </para>
</entry>
<entry>
- <para>. (dot)</para>
+ <para>tuple.id or tuple.(id,â¦)</para>
</entry>
<entry>
- <para>Retrieve a field from a tuple. </para>
+ <para>Tuple dereferencing can be done by name
(tuple.field_name) or position (mytuple.$0). If a set of fields are
dereferenced (tuple.(name1, name2) or tuple.($0, $1)), the expression
represents a tuple composed of the specified fields. Note that if the dot
operator is applied to a bytearray, the bytearray will be assumed to be a
tuple.</para>
</entry>
</row>
<row>
@@ -4094,10 +4133,10 @@
<para>bag dereference</para>
</entry>
<entry>
- <para>. (dot)</para>
+ <para>bag.id or bag.(id,â¦)</para>
</entry>
<entry>
- <para>Retrieve a column from a bag. </para>
+ <para>Bag dereferencing can be done by name (bag.field_name) or
position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or
bag.($0, $1)), the expression represents a bag composed of the specified
fields.</para>
</entry>
</row>
<row>
@@ -4105,25 +4144,13 @@
<para>map dereference</para>
</entry>
<entry>
- <para># </para>
+ <para>map#'key'</para>
</entry>
<entry>
- <para>For a key#value pair, look up the value for the specified
key.</para>
+ <para>Map dereferencing must be done by key (field_name#key or
$0#key). If the pound operator is applied to a bytearray, the bytearray is
assumed to be a map. If the key does not exist, the empty string is
returned.</para>
</entry>
</row></tbody></tgroup>
</informaltable>
- <para>Note the following:</para>
- <itemizedlist>
- <listitem>
- <para>Tuple dereferencing can be done by name (tuple.field_name) or
position (mytuple.$0). Note that if the dot operator is applied to a bytearray,
the bytearray will be assumed to be a tuple.</para>
- </listitem>
- <listitem>
- <para>Bag dereferencing can be done by name (bag.field_name) or
position (bag.$0). </para>
- </listitem>
- <listitem>
- <para>Map dereferencing must be done by key (field_name#key or
$0#key). If the pound operator is applied to a bytearray, the bytearray is
assumed to be a map. If the key does not exist, the empty string is
returned.</para>
- </listitem>
- </itemizedlist>
<section>
<title>Example: Tuple</title>
@@ -4400,6 +4427,15 @@
</informaltable></section></section></section>
<section>
+ <title>Flatten Operator</title>
+ <para>The flatten operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and bags in a way
that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the
same, but the operation and result is different for each type of
structure.</para>
+
+ <para>For tuples, flatten substitutes the fields of a tuple in place of the
tuple. For example, consider a relation that has a tuple of the form (a, (b,
c)). The expression GENERATE $0, flatten($1), will cause that tuple to become
(a, b, c).</para>
+
+ <para>For bags, the situation becomes more complicated. When we un-nest a
bag, we create new tuples. If we have a relation that is made up of tuples of
the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two
tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes
we cause a cross product to happen. For example, consider a relation that has a
tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator.
If we apply the expression GENERATE $0, flatten($1) to this tuple, we will
create new tuples: (a, b, c) and (a, d, e).</para>
+ </section>
+
+ <section>
<title>Cast Operators</title>
<section>
@@ -4765,7 +4801,7 @@
</entry>
</row></tbody></tgroup>
</informaltable>
-
+
<section>
<title>Syntax  </title>
<informaltable frame="all">
@@ -4970,148 +5006,8 @@
<section>
<title>COGROUP</title>
- <para>Groups the data in two or more relations.</para>
-
- <section>
- <title>Syntax</title>
- <informaltable frame="all">
- <tgroup cols="1"><tbody><row>
- <entry>
- <para>alias =COGROUP alias BY field_alias [INNER | OUTER] ,
alias  BY field_alias [INNER | OUTER] [PARALLEL n] ;</para>
- </entry>
- </row></tbody></tgroup>
- </informaltable></section>
-
- <section>
- <title>Terms</title>
- <informaltable frame="all">
- <tgroup cols="2"><tbody><row>
- <entry>
- <para>alias</para>
- </entry>
- <entry>
- <para>The name a relation.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para>field_alias</para>
- </entry>
- <entry>
- <para>The name of one or more fields in a relation. </para>
- <para>If multiple fields are specified, separate with commas
and enclose in parentheses. For example, X = COGROUP A BY (f1, f2);</para>
- <para>The number of fields specified in each BY clause must
match. For example, X = COGROUP A BY (a1,a2,a3), B BY (b1,b2,b3);</para>
- </entry>
- </row>
- <row>
- <entry>
- <para>BY</para>
- </entry>
- <entry>
- <para>Keyword.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para>INNER</para>
- </entry>
- <entry>
- <para>Keyword. </para>
- </entry>
- </row>
- <row>
- <entry>
- <para>OUTER</para>
- </entry>
- <entry>
- <para>Keyword.</para>
- </entry>
- </row>
- <row>
- <entry>
- <para>PARALLEL n</para>
- </entry>
- <entry>
- <para>Increase the parallelism of a job by specifying the
number of reduce tasks, n. The optimal number of parallel tasks depends on the
amount of memory on each node and the memory required by each of the tasks. To
determine n, use the following as a general guideline:</para>
- <para/>
- <para>Â Â n = (nr_nodes - 1) * 0.45 * nr_GB</para>
- <para/>
- <para>where nr_nodes is the number of nodes used and nr_GB is
the amount of  physical memory on each node.</para>
- <para>Note the following:</para>
- <itemizedlist>
- <listitem>
- <para>Parallel only affects the number of reduce tasks.
Map parallelism is determined by the input file, one map for each HDFS block.
</para>
- </listitem>
- <listitem>
- <para>If you donât specify parallel, you still get the
same map parallelism but only one reduce task. </para>
- </listitem>
- </itemizedlist>
- </entry>
- </row></tbody></tgroup>
- </informaltable></section>
-
- <section>
- <title>Usage</title>
- <para>The COGOUP operator groups the data in two or more relations based on
the common field values. Note that the COGROUP and JOIN operators perform
similar functions. COGROUP creates a nested set of output tuples while JOIN
creates a flat set of output tuples.</para></section>
-
- <section>
- <title>Examples</title>
- <para>Suppose we have two relations, A and B.</para>
-<programlisting>
-A = LOAD 'data1' AS (owner:chararray,pet:chararray);
-
-DUMP A;
-(Alice,turtle)
-(Alice,goldfish)
-(Alice,cat)
-(Bob,dog)
-(Bob,cat)
-
-B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
-
-DUMP B;
-(Cindy,Alice)
-(Mark,Alice)
-(Paul,Bob)
-(Paul,Jane)
-</programlisting>
-
- <para>In this example tuples are co-grouped using field âownerâ from
relation A and field âfriend2â from relation B as the key fields. The
DESCRIBE operator shows the schema for relation X, which has two fields,
"group" and "A" (see the GROUP operator for information about the field
names).</para>
-<programlisting>
-X = COGROUP A BY owner, B BY friend2;
-
-DESCRIBE X;
-X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1:
chararray,friend2: chararray}}
-</programlisting>
-
- <para>Relation X looks like this. A tuple is created for each unique key
field. The tuple includes the key field and two bags. The first bag is the
tuples from the first relation with the matching key field. The second bag is
the tuples from the second relation with the matching key field. If no tuples
match the key field, the bag is empty.</para>
-<programlisting>
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-(Jane,{},{(Paul,Jane)})
-</programlisting>
-
- <para>In this example tuples are co-grouped and the INNER keyword is used
to ensure that only bags with at least one tuple are returned. </para>
-<programlisting>
-X = COGROUP A BY owner INNER, B BY friend2 INNER;
-
-DUMP X;
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-</programlisting>
-
- <para>In this example tuples are co-grouped and the INNER keyword is used
asymmetrically on only one of the relations.</para>
-<programlisting>
-X = COGROUP A BY owner, B BY friend2 INNER;
-
-DUMP X;
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-(Jane,{},{(Paul,Jane)})
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-</programlisting>
-
- </section></section>
-
+ <para>COGROUP is the same as GROUP, but for readability purposes
programmers usually use GROUP when only one relation is involved and COGROUP
with multiple relations. See <xref linkend="GROUP" /> for more
information.</para>
+</section>
<section>
<title>CROSS</title>
<para>Computes the cross product of two or more relations.</para>
@@ -5367,7 +5263,7 @@
<para>expression</para>
</entry>
<entry>
- <para>An expression.</para>
+ <para>A boolean expression.</para>
</entry>
</row></tbody></tgroup>
</informaltable></section>
@@ -5459,7 +5355,7 @@
<para/>
<para>alias = FOREACH nested_alias {</para>
<para>  alias = nested_op; [alias = nested_op; â¦]</para>
- <para>  GENERATE expression [expression â¦.]</para>
+ <para>  GENERATE expression [, expression â¦]</para>
<para>};</para>
<para/>
<para>Where:</para>
@@ -5759,16 +5655,16 @@
</section></section>
- <section>
+ <section id="GROUP">
<title>GROUP</title>
- <para>Groups the data in a single relation.</para>
+ <para>Groups the data in a one or multiple relations. For readability
COGROUP is usually used with multiple relations and group is used with a single
relation, but they are the same operator.</para>
<section>
<title>Syntax</title>
<informaltable frame="all">
<tgroup cols="1"><tbody><row>
<entry>
- <para>alias = GROUP alias { [ALL] | [BY {[field_alias [,
field_alias]] | * | [expression]] }Â [PARALLEL n];</para>
+ <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL
| BY expression â¦] [PARALLEL n];</para>
</entry>
</row></tbody></tgroup>
</informaltable></section>
@@ -5804,27 +5700,10 @@
</row>
<row>
<entry>
- <para>field_alias</para>
- </entry>
- <entry>
- <para>The name of a field in a relation. This is the group key
or key field. </para>
- <para>A relation can be grouped by a single field (f1) or by
the composite value of multiple fields (f1,f2).</para>
- </entry>
- </row>
- <row>
- <entry>
- <para>*</para>
- </entry>
- <entry>
- <para>The designator for a tuple.</para>
- </entry>
- </row>
- <row>
- <entry>
<para>expression</para>
</entry>
<entry>
- <para>An expression.</para>
+ <para>A tuple expression. This is the group key or key field.
If the result of the tuple expression is a single field, the key will be the
value of the first field rather than a tuple with one field.</para>
</entry>
</row>
@@ -5853,7 +5732,7 @@
<section>
<title>Usage</title>
- <para>The GROUP operator groups together tuples that have the same group
key (key field). The result of a GROUP operation is a relation that includes
one tuple per group. This tuple contains two fields: </para>
+ <para>The GROUP operator groups together tuples that have the same group
key (key field). The key field will be a tuple if the group key has more than
one field, otherwise it will be the same type as that of the group key. The
result of a GROUP operation is a relation that includes one tuple per group.
This tuple contains two fields: </para>
<itemizedlist>
<listitem>
<para>The first field is named "group" (do not confuse this with the
GROUP operator) and is the same type of the group key.</para>
@@ -5863,7 +5742,11 @@
<para/>
<para>The names of both fields are generated by the system as shown
in the example below.</para>
</listitem>
- </itemizedlist></section>
+ </itemizedlist>
+ <para>
+ Note that the GROUP (and thus COGROUP) and JOIN operators perform similar
functions. GROUP creates a nested set of output tuples while JOIN creates a
flat set of output tuples.
+ </para>
+ </section>
<section>
<title>Example</title>
@@ -5923,10 +5806,6 @@
(20,{(Bill)})
</programlisting>
- </section></section>
-
- <section>
- <title>Example</title>
<para>Suppose we have relation A.</para>
<programlisting>
A = LOAD 'data' as (f1:chararray, f2:int, f3:int);
@@ -5946,6 +5825,62 @@
(2,{(r1,1,2),(r2,2,1)})
(16,{(r3,2,8),(r4,4,4)})
</programlisting>
+
+ <para>Suppose we have two relations, A and B.</para>
+<programlisting>
+A = LOAD 'data1' AS (owner:chararray,pet:chararray);
+
+DUMP A;
+(Alice,turtle)
+(Alice,goldfish)
+(Alice,cat)
+(Bob,dog)
+(Bob,cat)
+
+B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
+
+DUMP B;
+(Cindy,Alice)
+(Mark,Alice)
+(Paul,Bob)
+(Paul,Jane)
+</programlisting>
+
+ <para>In this example tuples are co-grouped using field âownerâ from
relation A and field âfriend2â from relation B as the key fields. The
DESCRIBE operator shows the schema for relation X, which has two fields,
"group" and "A" (see the GROUP operator for information about the field
names).</para>
+<programlisting>
+X = COGROUP A BY owner, B BY friend2;
+
+DESCRIBE X;
+X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1:
chararray,friend2: chararray}}
+</programlisting>
+
+ <para>Relation X looks like this. A tuple is created for each unique key
field. The tuple includes the key field and two bags. The first bag is the
tuples from the first relation with the matching key field. The second bag is
the tuples from the second relation with the matching key field. If no tuples
match the key field, the bag is empty.</para>
+<programlisting>
+(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
+(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
+(Jane,{},{(Paul,Jane)})
+</programlisting>
+
+ <para>In this example tuples are co-grouped and the INNER keyword is used
to ensure that only bags with at least one tuple are returned. </para>
+<programlisting>
+X = COGROUP A BY owner INNER, B BY friend2 INNER;
+
+DUMP X;
+(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
+(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
+</programlisting>
+
+ <para>In this example tuples are co-grouped and the INNER keyword is used
asymmetrically on only one of the relations.</para>
+<programlisting>
+X = COGROUP A BY owner, B BY friend2 INNER;
+
+DUMP X;
+(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
+(Jane,{},{(Paul,Jane)})
+(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
+</programlisting>
+
+ </section>
</section>
<section>
@@ -5957,7 +5892,7 @@
<informaltable frame="all">
<tgroup cols="1"><tbody><row>
<entry>
- <para>alias = JOIN alias BY field_alias, alias BY field_alias
[, alias BY field_alias â¦] [USING "replicated"] [PARALLEL n]; </para>
+ <para>alias = JOIN alias BY {expression|'('expression [,
expression â¦]')'} (, alias BY {expression|'('expression [, expression
â¦]')'} â¦) [USING "replicated"] [PARALLEL n]; </para>
</entry>
</row></tbody></tgroup>
</informaltable></section>
@@ -5983,10 +5918,10 @@
</row>
<row>
<entry>
- <para>field_alias</para>
+ <para>expression</para>
</entry>
<entry>
- <para>The name of a field in a relation. For the BY clause,
field_alias must be in alias.</para>
+ <para>A field expression.</para>
<para>Example: X = JOIN A BY fieldA, B BY fieldB, C BY
fieldC;</para>
</entry>
</row>
@@ -9848,4 +9783,4 @@
</section>
</article>
-
\ No newline at end of file
+