[Pig Wiki] Update of "PigLatin" by OlgaN

Apache Wiki Mon, 05 Nov 2007 16:35:21 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigLatin

------------------------------------------------------------------------------
+ [[Anchor(Introduction_to_Pig_Latin)]]
- = Introduction to Pig Latin =
+ == Introduction to Pig Latin ==
  
  [[TableOfContents]]
  
  So you want to learn Pig Latin. Welcome! Lets begin with the data types.
  
+ [[Anchor(Data_Types)]]
- == Data Types ==
+ === Data Types ===
  
  Every piece of data in Pig has one of these four types:
  
-  * A '''Data Atom''' is a simple atomic data value. It is stored as a string 
but can be used as either a string or a number (see [#FilterS Filters]). 
Examples of data atoms are 'apache.org' and '1.0'.
+    * A '''Data Atom''' is a simple atomic data value. It is stored as a 
string but can be used as either a string or a number (see 
[[#FilterS][Filters]]). Examples of data atoms are 'apache.org' and '1.0'.
-  * A '''Tuple''' is a data record consisting of a sequence of "fields". Each 
field is a piece of data of any type (data atom, tuple or data bag). We denote 
tuples with < > bracketing. An example of a tuple is <apache.org, 1.0>.
+    * A '''Tuple''' is a data record consisting of a sequence of "fields". 
Each field is a piece of data of any type (data atom, tuple or data bag). We 
denote tuples with < > bracketing. An example of a tuple is <apache.org,1.0>.
-  * A '''Data Bag''' is a set of tuples (duplicate tuples are allowed). You 
may think of it as a "table", except that Pig does not require that the tuple 
field  types match, or even that the tuples have the same number of fields! (It 
is up to you whether you want these properties.) We denote bags by { } 
bracketing. Thus, a data bag could be {<apache.org,1.0>, <flickr.com,0.8>}
+    * A '''Data Bag''' is a set of tuples (duplicate tuples are allowed). You 
may think of it as a "table", except that Pig does not require that the tuple 
field  types match, or even that the tuples have the same number of fields! (It 
is up to you whether you want these properties.) We denote bags by { } 
bracketing. Thus, a data bag could be {<apache.org,1.0>, <flickr.com,0.8>}
-  * A '''Data Map''' is a map from keys that are string literals to values 
that can be any data type. Think of it as a !HashMap<String,X> where X can be 
any of the 4 pig data types. A Data Map supports the expected get and put 
interface. We denote maps by [ ] bracketing, with ":" separating the key and 
the value, and ";" separating successive key value pairs. Thus. a data map 
could be [ 'apache' : <'pig', 'hadoop'> ; 'cnn' : 'news' ]. Here, the key 
'apache' is mapped to the tuple with 2 atomic fields 'pig' and 'hadoop', while 
the key 'cnn' is mapped to the data atom 'news'.
+    * A '''Data Map''' is a map from keys that are string literals to values 
that can be any data type. Think of it as a !HashMap<String,X> where X can be 
any of the 4 pig data types. A Data Map supports the expected get and put 
interface. We denote maps by [ ] bracketing, with ":" separating the key and 
the value, and ";" separating successive key value pairs. Thus. a data map 
could be [ 'apache' : <'search', 'news'> ; 'cnn' : 'news' ]. Here, the key 
'apache' is mapped to the tuple with 2 atomic fields 'search' and 'news', while 
the key 'cnn' is mapped to the data atom 'news'.
  
+ [[Anchor(Data_Items)]]
- == Data Items ==
+ === Data Items ===
  Data can be referred to in various powerful and convenient ways in Pig. Any 
data referred to is called a Data Item. We will illustrate all these ways by 
using the following example tuple.
  
+ {{{
- `t = < 1, {<2,3>,<4,6>,<5,7>}, ['apache':'hadoop']>`
+ t = < 1, {<2,3>,<4,6>,<5,7>}, ['apache':'search']>
- 
+ }}}
- Thus, ''t'' has 3 fields. Let these fields have names f1, f2, f3. Field f1 is 
an atom with value 1. Field f2 is a bag having 3 tuples. Field f3 is a data map 
having 1 key.
+ Thus, =t= has 3 fields. Let these fields have names f1, f2, f3. Field f1 is 
an atom with value 1. Field f2 is a bag having 3 tuples. Field f3 is a data map 
having 1 key.
  
  The following table lists the various methods of referring to data.
  
- || '''Method of Referring to Data''' || '''Example''' || '''Value for example 
tuple ''t'' ''' || '''Notes''' ||
+ || Method of Referring to Data || Example || Value for example tuple =t= || 
Notes ||
- || Constant || '1.0', or 'apache.org', or 'blah'|| Value constant 
irrespective of ''t'' || ||
+ || '''Constant''' || ''''1.0'''', or ''''apache.org'''', or ''''blah'''' || 
Value constant irrespective of =t= || ||
- || Field referred to by position || $0 || Data Atom '1' || In Pig, positions 
start at 0 and not 1 ||
+ || '''Field referred to by position''' || '''$0''' || Data Atom '1' || '''In 
Pig, positions start at 0 and not 1''' ||
- || Field referred to by name || f2 || Bag {<2,3>,<4,6>,<5,7>} || ||
+ || '''Field referred to by name''' || *f2*|| Bag {<2,3>,<4,6>,<5,7>} || ||
- || Projection of another data item || f2.$0 || Bag {<2>,<4>,<5>} - the bag f2 
projected to the first field || ||
+ || '''Projection''' of another data item || '''f2.$0''' || Bag {<2>,<4>,<5>} 
- the bag f2 projected to the first field || ||
- || Map Lookup against another data item || f3#'apache' || Data Atom 'pig' || 
* User's responsibility to ensure that a lookup is written only against a  data 
map, otherwise a runtime error is thrown. [[BR]] * If the key being looked up 
does not exist, a Data Atom with an empty string is returned ||
+ || '''Map Lookup''' against another data item || '''f3#'apache'''' || Data 
Atom 'search' ||* User's responsibility to ensure that a lookup is written only 
against a  data map, otherwise a runtime error is thrown <br>   * If the key 
being looked up does not exist, a Data Atom with an empty string is returned||
- || Function applied to another data item || SUM(f2.$0) || 2+4+5 = 11 || SUM 
is a builtin Pig function. See PigFunctions for how to write your own functions 
||
+ || '''Function''' applied to another data item || '''SUM(f2.$0)''' || 2+4+5 = 
11 || SUM is a builtin Pig function. See PigFunctions for how to write your own 
functions ||
- || Infix Expression of other data items || COUNT(f2) + f1 / '2.0' || 3 + 1 / 
2.0 = 3.5 || ||
+ || '''Infix Expression''' of other data items || '''COUNT(f2) + f1 / '2.0'''' 
|| 3 + 1 / 2.0 = 3.5 ||  ||
- || Bincond, i.e., the value of the data item is chosen according to some 
condition ||(f1 = =  '1' ? '2' : COUNT(f2))|| '2' since f1=='1' is true. If f1 
were != '1', then the value of this data item for t would be COUNT(f2)=3 || See 
[#CondS Conditions] for what the format of the condition in the bincond can be 
||
+ || '''Bincond''', i.e., the value of the data item is chosen according to 
some condition || '''(f1 = =  '1' ? '2' : COUNT(f2))''' || '2' since f1=='1' is 
true. If f1 were != '1', then the value of this data item for t would be 
COUNT(f2)=3 || See [[#CondS][Conditions]] for what the format of the condition 
in the bincond can be ||
  
+ 
- == Pig Latin Statements ==
+ ===Pig Latin Statements ===
  
  A Pig Latin statement is a command that produces a '''Relation'''. A relation 
is simply a data bag with a name. That name is called the relation's 
'''alias'''. The simplest Pig Latin statement is LOAD, which reads a relation 
from a file in the file system. Other Pig Latin statements process one or more 
input relations, and produce a new relation as a result.
  
- Pig commands can span multiple lines and must include ";" at the end.
+ Starting with Pig 1.2 release due on 09/30/07, pig commands can span multiple 
lines and must include ";" at the end.
  
  Examples:
  
+ {{{
- `grunt> A = load 'data' using PigStorage() as (x, y, z);`
+ grunt> A = load 'mydoc' using PigStorage()
+ as (a, b, c);
- `grunt>B = group A by x;`
+ grunt>B = group A by a;
- `grunt> C = foreach B {`[[BR]]
+ grunt> C = foreach B {
- `D = distinct A.y;` [[BR]]
+ D = distinct A.b;
- `generate flatten(group), COUNT(D);` [[BR]]
+ generate flatten(group), COUNT(D);
- `}`[[BR]]
+ }
- `grunt>` 
+ grunt> 
+ }}}
+  
+ [[Anchor(LOAD:_Loading_data_from_a_file)]]
+ ==== LOAD: Loading data from a file ====
  
+ Before you can do any processing, you first need to load the data. This is 
done by the LOAD statement. Suppose we have a tab-delimited file called 
"myfile.txt" that contains a relation, whose contents are:
  
+ {{{
+ 1    2    3
+ 4    2    1
+ 8    3    4
+ 4    3    3
+ 7    2    5
+ 8    4    3
+ }}}
+ 
+ Suppose we want to refer to the 3 fields as f1, f2, and f3. We can load this 
relation using the following command:
+ 
+ <blockquote><verbatim>
+ A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3);
+ </verbatim></blockquote>
+ 
+ <noautolink>
+ Here, PigStorage is the name of a "storage function" that takes care of 
parsing the file into a Pig relation. This storage function expects simple 
newline-separated records with delimiter-separated fields; it has one 
parameter, namely the field delimiter(s).  
+ </noautolink>
+ 
+ Future Pig Latin commands can refer to the alias "A" and will receive data 
that has been loaded from "myfile.txt". A will contain this data:
+ 
+ {{{
+ <1, 2, 3>
+ <4, 2, 1>
+ <8, 3, 4>
+ <4, 3, 3>
+ <7, 2, 5>
+ <8, 4, 3>
+ }}}
+ 
+ Notes:
+    * The storage function shown above is the default and can be omitted. 
+    * In the current (1.2) and earlier releases, storage functions are case 
sensitive. This will get changes in the future releases.
+    * If you don't want give names to fields, the AS clause can be omitted. 
You can refer to the fields by position, $0 for the first field and so on. 
+    * You can specify more complex schemas in the AS clause (see 
PigLatinSchemas).
+    * If your records are stored in some special format that our functions 
can't parse, you can of course write your own storage function (see 
PigFunctions).
+    * In Pig, relations are ''unordered'', which means we do not guarantee 
that tuples are processed in any particular order. (In fact, processing may be 
parallelized, in which case tuples are not processed according to ''any'' total 
ordering.)
+    * If you pass a directory name to LOAD, it will load all files within the 
directory.
+    * You can use hadoop supported globbing to specify a file or list of files 
to load.  See 
http://lucene.apache.org/hadoop/api/org/apache/hadoop/fs/FileSystem.html#globPaths(org.apache.hadoop.fs.Path)][
 the hadoop glob documentation for details on globbing syntax.  Globs can be 
used at the file system or directory levels.  (This functionality is available 
as of pig 1.1e.)
+   
+ [[Anchor(FILTER:_Getting_rid_of_data_you_are_not_interested_in_)]]
+ ==== FILTER: Getting rid of data you are not interested in  ====
+ Very often, the first thing that you want to do with data is to get rid of 
tuples that you are not interested in. This can be done by the filter 
statement. For example,
+ 
+ <blockquote><verbatim>
+ Y = FILTER A BY f1 == '8';
+ </verbatim></blockquote>
+ 
+ The result is Y =
+ 
+ {{{
+ <8, 3, 4>
+ <8, 4, 3>
+ }}}
+ 
+ [[Anchor(Specifying_Conditions)]]
+ ===== Specifying Conditions =====
+ The condition following the keyword BY can be much more general than as shown 
above. 
+    * The logical connectives AND, OR and NOT can be used to build a condition 
from various atomic conditions. 
+    * Each atomic condition can be of the form `&lt;Data Item&gt; 
&lt;compOp&gt; &lt;Data Item&gt;` (see [[#DataItems][Data Items]] for what the 
format of data items can be). 
+    * The comparison operator compOp can be one of 
+       * '''==, <nop>!=, >, >=, <, or <=''' for '''numerical''' comparisons. 
'''Note that if these operators are used on non-numeric data, a runtime error 
will be thrown'''.
+       * '''eq, neq, gt, gte, lt, or lte''' for string comparisons
+       * '''matches''' for regular expression matching, e.g., $0 matches 
"*apache*". The 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html][format of 
regular expressions is that supported by Java.
+ 
+ Thus, a somewhat more complicated condition can be
+ <blockquote><verbatim>
+ Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1));
+ </verbatim></blockquote>
+ 
+ Note:
+    * If you want to get rid of specifc columns or fields, rather than whole 
tuples, you should use the [[#ForeachS][FOREACH]] statement and not the filter 
statement.
+    * If the builtin comparison operators are not sufficient for your needs, 
you can write your own '''filter function''' (see PigFunctions for details). 
Suppose you wrote a new equality function (say myEquals). Then the first 
example above can be written as `Y = FILTER A BY myEquals(f1,'8');`
+ 
+ [[Anchor(COGROUP:_Getting_the_relevant_data_together)]]
+ ==== COGROUP: Getting the relevant data together ====
+ 
+ We can group the tuples in A according to some specification. A simple 
specification is to group according to the value of one of the fields, e.g. the 
first field. This is done as follows:
+ 
+ <blockquote><verbatim>
+ X = GROUP A BY f1;
+ X = GROUP A BY (f1, f2 ..);
+ </verbatim></blockquote>
+ 
+ The result of the group statement consists of one tuple for each group. The 
first field of the tuple has name `group` and has the value on which the 
grouping has been performed, and the second field has name A and is a bag 
containing the tuples belonging to that group. Thus, X = :
+ 
+ {{{
+ <1, {<1, 2, 3>}>
+ <4, {<4, 2, 1>, <4, 3, 3>}>
+ <7, {<7, 2, 5>}>
+ <8, {<8, 3, 4>, <8, 4, 3>}>
+ }}}
+ 
+ Suppose we have a second relation B =
+ 
+ {{{
+ <2, 4>
+ <8, 9>
+ <1, 3>
+ <2, 7>
+ <2, 9>
+ <4, 6>
+ <4, 9>
+ }}}
+ 
+ We can ''co-group'' A and B, which means that we jointly group the tuples 
from A and B, using this command:
+ 
+ <blockquote><verbatim>
+ COGROUP A BY f1, B BY $0;
+ </verbatim></blockquote>
+ 
+ You can co-group by multiple columns the same way as for group.
+ 
+ The result is:
+ 
+ {{{
+ <1, {<1, 2, 3>}, {<1, 3>}>
+ <2, {}, {<2, 4>, <2, 7>, <2, 9>}>
+ <4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>,<4, 9>}>
+ <7, {<7, 2, 5>}, {}>
+ <8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>
+ }}}
+ 
+ Now, all of the original tuples whose first field is 1 are grouped together, 
the original tuples whose first value is 2 are together, and so on. Thus, 
similar to a group, the result of a co-group has one tuple for each group. The 
first field is called `group` as before and contains the value on which 
grouping has been performed. Besides, every tuple has a bag for each relation 
being co-grouped (having the same name as the alias for that relation) that 
contains the tuples of that relation belonging to that group. 
+ 
+ Note that some of the bags are empty, which indicates that no tuples from the 
corresponding input belong to that group. If we only wish to see groups for 
which <i>both</i> inputs have at least one tuple, we can write:
+ 
+ <blockquote><verbatim>
+ C = COGROUP A BY $0 INNER, B BY $0 INNER;
+ </verbatim></blockquote>
+ 
+ The result is C = 
+ 
+ {{{
+ <1, {<1, 2, 3>}, {<1, 3>}>
+ <4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>, <4, 9>}>
+ <8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>
+ }}}
+ 
+ The INNER keyword can be used asymmetrically, with the obvious meaning.
+ 
+ In addition to using columns to group the data, an arbitrary expression can 
be used:
+ 
+ <blockquote><verbatim>
+ grunt> cat a      
+ r1    1       2
+ r2    2       1
+ r3    2       8
+ r4    4       4
+ grunt> a = load 'a';
+ grunt> b = group a by $1*$2;
+ grunt> dump b;
+ 
+ ------ MapReduce Job -----
+ Input: [/user/utkarsh/a:org.apache.pig.builtin.PigStorage()]
+ Map: [[*]]
+ Group: [GENERATE {[org.apache.pig.impl.builtin.MULTIPLY(GENERATE {[PROJECT
+ $1],[PROJECT $2]})],[*]}]
+ Combine: null
+ Reduce: null
+ Output: /tmp/temp1762405695/tmp1820603819:org.apache.pig.builtin.BinStorage
+ Split: null
+ Map parallelism: -1
+ Reduce parallelism: -1
+ Job jar size = 399671
+ Pig progress = 0%
+ Pig progress = 50%
+ Pig progress = 100%
+ (2.0, {(r1, 1, 2), (r2, 2, 1)})
+ (16.0, {(r3, 2, 8), (r4, 4, 4)})
+ grunt> 
+ </verbatim></blockquote>
+ 
+ Note: 
+    * If we want all tuples to go to a single group, e.g., when doing 
aggregates across entire relations, we can write `GROUP A ALL`.
+    * Similarly, if we don't care about how grouping is performed, we can 
write `GROUP A ANY`. In this case, the system will group tuples randomly into 
groups.
+    * A relation can be grouped (or co-grouped) according to the composite 
value of multiple fields. Thus, we can write `COGROUP A BY (f1,f2), B BY 
($0,$1)`.
+    * If the criteria on which the grouping has to be performed is more 
complicated that just the values of some fields, you can write your own Group 
Function, say myGroupFunc. Then we can write `GROUP A by myGroupFunc(*)`. Here 
"*" is a shorthand for all fields in the tuple. See PigFunctions for details.
+    * A Group function can return multiple values for a tuple, i.e., a single 
tuple can belong to multiple groups. 
+ 
+ [[Anchor(FOREACH_..._GENERATE:_Applying_transformations_to_the_data)]]
+ ==== FOREACH ... GENERATE: Applying transformations to the data ====
+ The FOREACH statement is used to apply transformations to the data and to 
generate new [[#DataItems][data items]]. The basic syntax is
+ 
+ `<output-alias> = FOREACH <input-alias> GENERATE <data-item 1>, <data-item 
2>, ... ;`
+ 
+ For each tuple in the input alias, the data items are evaluated, and a tuple 
containing these data items is put in the output alias. We explain this 
statement in greater detail by giving examples of typical uses.
+ 
+ [[Anchor(Projection)]]
+ ===== Projection =====
+ 
+ To select a subset of columns from a relation, use this command:
+ 
+ <blockquote><verbatim>
+ X = FOREACH A GENERATE f1, f2;
+ </verbatim></blockquote>
+ 
+ X contains tuples from A, but with only the first and second fields present 
in each tuple. For the value of A given above, X =
+ 
+ {{{
+ <1, 2>
+ <4, 2>
+ <8, 3>
+ <4, 3>
+ <7, 2>
+ <8, 4>
+ }}}
+ 
+ Projection elements can be given names using `as <alias>` construct. This 
allows to refer to the fields of the produced expression by name in the later 
statements:
+ 
+ {{{
+ X = FOREACH A GENERATE f1+f2 as sumf1f2;
+ Y = FILTER X by sumf1f2 > '5';
+ }}}
+ 
+ As with SQL, asterisk (*) is shorthand for all columns. For example, with:
+ 
+ <blockquote><verbatim>
+ X = FOREACH A GENERATE *;
+ </verbatim></blockquote>
+ 
+ X is identical to A.
+ 
+ [[Anchor(Nested_projection)]]
+ ===== Nested projection =====
+ 
+ If one of the fields in the input relation, is a non-atomic field, we can 
perform projection on that field. For example, 
+ 
+ <blockquote><verbatim>
+ FOREACH C GENERATE group, B.$1;
+ </verbatim></blockquote>
+ 
+ The result is:
+ 
+ {{{
+ <1, {<3>}>
+ <4, {<6>, <9>}>
+ <8, {<9>}>
+ }}}
+ 
+ Here is another example, in which multiple nested columns are retained:
+ 
+ <blockquote><verbatim>
+ FOREACH C GENERATE group, A.(f1, f2);
+ </verbatim></blockquote>
+ 
+ The result is:
+ 
+ {{{
+ <1, {<1, 2>}>
+ <4, {<4, 2>, <4, 3>}>
+ <8, {<8, 3, 4>, <8, 4>}>
+ }}}
+ 
+ [[Anchor(Applying_functions)]]
+ ===== Applying functions =====
+ 
+ Pig has a number of built-in functions. An example is the SUM() function, 
which takes the sum of a set of numbers in a bag. For example:
+ 
+ <blockquote><verbatim>
+ FOREACH C GENERATE group, SUM(A.f1);
+ </verbatim></blockquote>
+ 
+ gives:
+ 
+ {{{
+ <1, 1>
+ <4, 8>
+ <8, 16>
+ }}}
+ 
+ You may also register your own function with Pig, and refer to it in Pig 
Latin commands. See PigFunctions.
+ 
+ [[Anchor(Flattening)]]
+ ===== Flattening =====
+ 
+ Sometimes we want to eliminate nesting. This can be accomplished via the 
FLATTEN keyword which can be attached before any valid data item. For example:
+ 
+ <blockquote><verbatim>
+ FOREACH C GENERATE group, FLATTEN(A);
+ </verbatim></blockquote>
+ 
+ yields:
+ 
+ {{{
+ <1, 1, 2, 3>
+ <4, 4, 2, 1>
+ <4, 4, 3, 3>
+ <8, 8, 3, 4>
+ <8, 8, 4, 3>
+ }}}
+ 
+ As another example,
+ 
+ <blockquote><verbatim>
+ FOREACH C GENERATE group, FLATTEN(A.f3);
+ </verbatim></blockquote>
+ 
+ yields:
+ 
+ {{{
+ <1, 3>
+ <4, 1>
+ <4, 3>
+ <8, 4>
+ <8, 3>
+ }}}
+ 
+ As a final example,
+ 
+ <blockquote><verbatim>
+ FOREACH C GENERATE flatten(A.(f1, f2)), flatten(B.$1);
+ </verbatim></blockquote>
+ 
+ yields:
+ 
+ {{{
+ <1, 2, 3>
+ <4, 2, 6>
+ <4, 3, 6>
+ <4, 2, 9>
+ <4, 3, 9>
+ <8, 3, 9>
+ <8, 4, 9>
+ }}}
+ 
+ Note that for the group '4' in C, there were 2 tuples each in the bags A and 
B. Thus, when both the bags are flattened, the cross product of these tuples is 
returned, i.e., the tuples  <4, 2, 6>, <4, 3, 6>, <4, 2, 9>, and <4, 3, 9> in 
the result.
+ 
+ [[Anchor(Joining)]]
+ ===== Joining =====
+ 
+ The equi-join of A and B on column 0 can be expressed as follows:
+ 
+ <blockquote><verbatim>
+ JOIN A BY $0, B BY $0;
+ </verbatim></blockquote>
+ 
+ which is equivalent to:
+ 
+ <blockquote><verbatim>
+ X = COGROUP A BY $0 INNER, B BY $0 INNER;
+ FOREACH X GENERATE FLATTEN(A), FLATTEN(B);
+ </verbatim></blockquote>
+ 
+ The result is:
+ 
+ {{{
+ <1, 2, 3, 1, 3>
+ <4, 2, 1, 4, 6>
+ <4, 3, 3, 4, 6>
+ <4, 2, 1, 4, 9>
+ <4, 3, 3, 4, 9>
+ <8, 3, 4, 8, 9>
+ <8, 4, 3, 8, 9>
+ }}}
+ 
+ <i>Note:</i> On flattening, we might end with fields that have the same name 
but which came from different tables. They are disambiguated by prepending 
`<alias>::` to their names. See PigLatinSchemas.
+ 
+ [[Anchor(ORDER:_Sorting_data_according_to_some_fields)]]
+ ==== ORDER: Sorting data according to some fields ====
+ We can sort the contents of any alias according to any set of columns. For 
example,
+ 
+ <blockquote>
+ {{{
+ X = ORDER A BY $2;
+ }}}
+ </blockquote>
+ 
+ One possible output (since ties are resolved arbitrarily) is X =
+ {{{
+ <4, 2, 1>
+ <1, 2, 3>
+ <4, 3, 3>
+ <8, 4, 3>
+ <8, 3, 4>
+ <7, 2, 5>
+ }}}
+ 
+ Notes:
+    * From the point of view of the Pig data model, A and X contain the same 
thing (since we mentioned earlier that relations are logically unordered). If 
you process X further, there is no guarantee that tuples will be processes in 
order.
+    * However, the only guarantee is that if we retrieve the contents of X 
(see [[#RetrievingR][Retreiving Results]]), they are guaranteed to be in order 
of $2 (the third field).
+    * To sort according to the combination of all columns, you can write 
`ORDER A by *` 
+ 
+ [[Anchor(DISTINCT:_Eliminating_duplicates_in_data)]]
+ ==== DISTINCT: Eliminating duplicates in data ====
+ We can eliminate duplicates in the contents of any alias. For example, 
suppose we first say
+ 
+ {{{
+ X = FOREACH A GENERATE $2;
+ }}}
+ 
+ As we know, this will result in  X =
+ 
+ {{{
+ <3>
+ <1>
+ <4>
+ <3>
+ <5>
+ <3>
+ }}}
+ 
+ Now, if we say
+ 
+ <blockquote>
+ {{{
+ Y = DISTINCT X;
+ }}}
+ </blockquote>
+ 
+ The output is Y =
+ 
+ {{{
+ <1>
+ <3>
+ <5>
+ }}}
+ 
+ Notes:
+    * Note that original order is not preserved (another illustration of the 
fact that Pig relations are unordered). In fact, to eliminate duplicates, the 
input will be first sorted. 
+    * You can '''not''' request for distinct on a subset of the columns. This 
can be done by [[#ProjectS][projection]] followed by the DISTINCT statement as 
in the above example.
+ 
+ 
+ [[Anchor(CROSS:_Computing_the_cross_product_of_multiple_relations)]]
+ ==== CROSS: Computing the cross product of multiple relations ====
+ 
+ To compute the cross product (also known as "cartesian product") of two or 
more relations, use:
+ 
+ <blockquote><verbatim>
+ X = CROSS A, B;
+ </verbatim></blockquote>
+ 
+ Based on the values of A and B given earlier in the document, the result is X 
=
+ 
+ {{{
+ <1, 2, 3, 2, 4>
+ <1, 2, 3, 8, 9>
+ <1, 2, 3, 1, 3>
+ <1, 2, 3, 2, 7>
+ <1, 2, 3, 2, 9>
+ <1, 2, 3, 4, 6>
+ <1, 2, 3, 4, 9>
+ <4, 2, 1, 2, 4>
+ <4, 2, 1, 8, 9>
+ ...
+ }}}
+ 
+ Notes:
+    * This is an expensive operation and should not be usually necessary.
+ 
+ [[Anchor(UNION:_Computing_the_union_of_multiple_relations)]]
+ ==== UNION: Computing the union of multiple relations ====
+ 
+ We can vertically glue together contents of multiple aliases into a single 
alias by the UNION command. For example,
+ 
+ <blockquote><verbatim>
+ X = UNION A, B;
+ </verbatim></blockquote>
+ 
+ The result is X =
+ 
+ {{{
+ <1, 2, 3>
+ <4, 2, 1>
+ <8, 3, 4>
+ <4, 3, 3>
+ <7, 2, 5>
+ <8, 4, 3>
+ <2, 4>
+ <8, 9>
+ <1, 3>
+ <2, 7>
+ <2, 9>
+ <4, 6>
+ <4, 9>
+ }}}
+ 
+ Notes:
+    * UNION is not order-preserving. The inputs are interpreted as unordered 
bag of tuples and the output union is also an unordered bag.
+    * UNION does not ensure (like in databases) that the tuples all adhere to 
the same schema, or even that they have the same number of fields, as in the 
above example. However, in the typical case, it should be so, and it is the 
user's responsibility to 
+       * either ensure the same kind of tuples in all aliases being unioned, 
or 
+       * be able to handle the different kinds of tuples while processing the 
result of the union.
+    * UNION does not eliminate duplicate tuples.
+ 
+ [[Anchor(SPLIT:_Separating_data_into_different_relations)]]
+ ==== SPLIT: Separating data into different relations ====
+ The SPLIT statement, in some sense, is the converse of the UNION statement. 
It is used to partition the contents of a relation into multiple relations 
based on desired conditions. 
+ 
+ 
+ An example of a SPLIT statement is the following,
+ 
+ <blockquote><verbatim>
+ SPLIT A INTO X IF $0 < 7, Y IF ($0 > 2 AND $0<> 7);
+ </verbatim></blockquote>
+ 
+ The output is 
+ 
+ {{{
+ X = 
+ <1, 2, 3>
+ <4, 2, 1>
+ <4, 3, 3>
+ 
+ and 
+ 
+ Y = 
+ <4, 2, 1>
+ <8, 3, 4>
+ <4, 3, 3>
+ <8, 4, 3>
+ }}}
+ 
+ Notes:
+    * This construct is useful if you want to logically output multiple things 
from your function. You can then attach a field to the output of your function, 
and later split on that field to get the multiple outputs.
+    * One tuple can go to multiple partitions, e.g., the <4, 2, 1> tuple above.
+    * A tuple might also go to none of the partitions, if it doesn't satisfy 
any of the conditions, e.g., the <7, 2, 5> tuple above.
+    * [[#CondS][Conditions]] can be specified as mentioned in the Filter 
statement.
+ 
+ 
+ [[Anchor(Nested_Operations_in_FOREACH...GENERATE)]]
+ ==== Nested Operations in FOREACH...GENERATE ====
+ If one of the fields in the input relation is a data bag, the nested data bag 
can be treated as an '''inner''' or a '''nested relation'''. Consequently, in a 
FOEACH...GENERATE statement, we can perform many of the operations on this 
nested relation that we can on a regular relation. 
+ 
+ The specific operations that we can do on the nested relations are 
[[#FilterS][FILTER]], [[#OrderS][ORDER]], and [[#DistinctS][DISTINCT]]. Note 
that we do not allow FOREACH...GENERATE on the nested relation, since that 
leads to the possibility of arbitrary number of nesting levels. 
+ 
+ The syntax for doing the nested operations is very similar to the regular 
syntax and is demonstrated by the following example:
+ 
+ <blockquote><verbatim>
+ W = LOAD '...' AS (url, outlink);
+ G = GROUP W by url;
+ R = FOREACH G {
+       FW = FILTER W BY outlink eq 'www.apache.org';
+       PW = FW.outlink;
+       DW = DISTINCT PW;
+       GENERATE group, COUNT(DW);
+ }
+ </verbatim></blockquote>
+ 
+ Notes:
+    * Note the nested block within the FOREACH...GENERATE statement. The 
syntax is the same as regular Pig Latin syntax.
+    * The last statement in the nested block must be a GENERATE.
+    * Within the nested block, one can do nested filering, projection, 
sorting, and duplicate elimination.
+ 
+ 
+ [[Anchor(Increasing_the_parallelism)]]
+ === Increasing the parallelism ===
+ 
+ To increase the parallelism of a job, include the PARALLEL clause in any of 
your Pig latin statements.
+ 
+ For example, `J = JOIN A by url, B by url PARALLEL 50` 
+ 
+ Couple of notes:
+    * PARALLEL keyword only effects the number of reduce tasks. Map 
parallelism is determined by the input file, one map for each HDFS block.
+    * Degree of parallelism depends on the size of your cluster.  At most 2 
map or reduce tasks can run on a machine simultaneously. So if you ask for 40 
machines, you might ask for a 1000 reduces, but they will still run 80 at a 
time. The example above would generate 50 reduce jobs if your cluster has at 
least 25 machines.
+    * When you donât specify parallel, you still get the same map 
parallelism but only 1 reduce task.
+ 
+ [[Anchor(Retrieving_Results)]]
+ === Retrieving Results ===
+ 
+ There are several convenient ways to retrieve the contents in a particular 
alias: 
+ 
+    * If you are issuing PigLatin through Grunt
+       1. The command `dump <alias>` will dump the contents of the alias on 
your screen. This is typically useful only as a sanity check to see if the 
correct results are being produced. 
+       1. The command `store &lt;alias&gt; into &lt;filename&gt; [ using 
&lt;store function spec&gt;]` will store the contents of the alias into the 
requested filename using the requested storage function (or the default 
function !PigStorage if the storage function is not specified. )
+ 
+    * If you are issuing PigLatin through your java program
+       1. The call `PigServer.openIterator(String alias)` will give you an 
iterator over the contents of the alias. You may find it useful to know the 
PigDataTypeAPIs to process these contents.
+       1. The call `PigServer.store(String alias, String fileName, String 
storeFunc)` will (like the store command in grunt) store the contents of the 
alias into the requested filename using the requested storage function (or the 
default function !PigStorage if the storage function is not specified).
+ 
+ Note
+    * In the current (1.2) and earlier releases, storage functions are case 
sensitive. This will get changes in the future releases.
+    * !PigStorage can only store flat tuples, i.e., tuples having atomic 
fields. If you want to store nested data, use !BinStorage instead.
+ 
+ [[Anchor(Experimenting_with_Pig_Latin_syntax)]]
+ === Experimenting with Pig Latin syntax ===
+ 
+ To experiment with the Pig Latin syntax, you can use the !StandAloneParser. 
Invoke it by the following command:
+ 
+ <blockquote>
+ {{{
+ java -cp pig.jar org.apache.pig.StandAloneParser
+ }}}
+ </blockquote>
+ 
+ 
+ Example usage:
+ 
+ {{{
+ $ java -cp pig.jar org.apache.pig.StandAloneParser
+ > A = LOAD 'myfile.txt';
+ ---- Query parsed successfully ---
+ > B = FOREACH A GENERATE $1, $2;
+ ---- Query parsed successfully ---
+ > C = COGROUP A BY $0, B BY $0;
+ ---- Query parsed successfully ---
+ Current aliases: A->null, 
+ > D = FOREACH C blah blah blah;
+ Parse error: org.apache..pig.impl.logicalLayer.parser.ParseException: 
Encountered "blah" at line 1, column 15.
+ Was expecting one of:
+     "generate" ...
+     "{" ...
+ > D = FOREACH C GENERATE 'hello world';
+ ---- Query parsed successfully ---
+ > quit
+ $ 
+ }}}
+ 
+ [[Anchor(Example_Pig_Latin_programs)]]
+ === Example Pig Latin programs ===
+ 
+ See PigLatinExamples
+ 
+ [[Anchor(Embedded_Pig_Latin)]]
+ === Embedded Pig Latin ===
+ 
+ Pig Latin can be embedded into a Java program in a manner similar to JDBC. 
See [#EmbeddedP the section on embedding PigLatin].
+

[Pig Wiki] Update of "PigLatin" by OlgaN

Reply via email to