Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by OlgaN:

- [[Anchor(Introduction_to_Pig_Latin)]]
+ [[Anchor(Introduction)]]
  == Introduction to Pig Latin ==
@@ -12, +12 @@

  Every piece of data in Pig has one of these four types:
-    * A '''Data Atom''' is a simple atomic data value. It is stored as a 
string but can be used as either a string or a number (see 
[[#FilterS][Filters]]). Examples of data atoms are '' and '1.0'.
+    * A '''Data Atom''' is a simple atomic data value. It is stored as a 
string but can be used as either a string or a number (see #Filter). Examples 
of data atoms are '' and '1.0'.
     * A '''Tuple''' is a data record consisting of a sequence of "fields". 
Each field is a piece of data of any type (data atom, tuple or data bag). We 
denote tuples with < > bracketing. An example of a tuple is <,1.0>.
     * A '''Data Bag''' is a set of tuples (duplicate tuples are allowed). You 
may think of it as a "table", except that Pig does not require that the tuple 
field  types match, or even that the tuples have the same number of fields! (It 
is up to you whether you want these properties.) We denote bags by { } 
bracketing. Thus, a data bag could be {<,1.0>, <,0.8>}
     * A '''Data Map''' is a map from keys that are string literals to values 
that can be any data type. Think of it as a !HashMap<String,X> where X can be 
any of the 4 pig data types. A Data Map supports the expected get and put 
interface. We denote maps by [ ] bracketing, with ":" separating the key and 
the value, and ";" separating successive key value pairs. Thus. a data map 
could be [ 'apache' : <'search', 'news'> ; 'cnn' : 'news' ]. Here, the key 
'apache' is mapped to the tuple with 2 atomic fields 'search' and 'news', while 
the key 'cnn' is mapped to the data atom 'news'.
@@ -58, +58 @@

- [[Anchor(LOAD:_Loading_data_from_a_file)]]
+ [[Anchor(Load)]]
  ==== LOAD: Loading data from a file ====
  Before you can do any processing, you first need to load the data. This is 
done by the LOAD statement. Suppose we have a tab-delimited file called 
"myfile.txt" that contains a relation, whose contents are:
@@ -101, +101 @@

     * If you pass a directory name to LOAD, it will load all files within the 
     * You can use hadoop supported globbing to specify a file or list of files 
to load.  See][
 the hadoop glob documentation for details on globbing syntax.  Globs can be 
used at the file system or directory levels.  (This functionality is available 
as of pig 1.1e.)
- [[Anchor(FILTER:_Getting_rid_of_data_you_are_not_interested_in_)]]
+ [[Anchor(Filter)]]
  ==== FILTER: Getting rid of data you are not interested in  ====
  Very often, the first thing that you want to do with data is to get rid of 
tuples that you are not interested in. This can be done by the filter 
statement. For example,
@@ -116, +116 @@

  <8, 4, 3>
- [[Anchor(Specifying_Conditions)]]
+ [[Anchor(Condition)]]
  ===== Specifying Conditions =====
  The condition following the keyword BY can be much more general than as shown 
     * The logical connectives AND, OR and NOT can be used to build a condition 
from various atomic conditions. 
@@ -135, +135 @@

     * If you want to get rid of specifc columns or fields, rather than whole 
tuples, you should use the [[#ForeachS][FOREACH]] statement and not the filter 
     * If the builtin comparison operators are not sufficient for your needs, 
you can write your own '''filter function''' (see PigFunctions for details). 
Suppose you wrote a new equality function (say myEquals). Then the first 
example above can be written as `Y = FILTER A BY myEquals(f1,'8');`
- [[Anchor(COGROUP:_Getting_the_relevant_data_together)]]
+ [[Anchor(Cogroup)]]
  ==== COGROUP: Getting the relevant data together ====
  We can group the tuples in A according to some specification. A simple 
specification is to group according to the value of one of the fields, e.g. the 
first field. This is done as follows:
@@ -241, +241 @@

     * If the criteria on which the grouping has to be performed is more 
complicated that just the values of some fields, you can write your own Group 
Function, say myGroupFunc. Then we can write `GROUP A by myGroupFunc(*)`. Here 
"*" is a shorthand for all fields in the tuple. See PigFunctions for details.
     * A Group function can return multiple values for a tuple, i.e., a single 
tuple can belong to multiple groups. 
- [[Anchor(FOREACH_..._GENERATE:_Applying_transformations_to_the_data)]]
+ [[Anchor(Foreach)]]
  ==== FOREACH ... GENERATE: Applying transformations to the data ====
  The FOREACH statement is used to apply transformations to the data and to 
generate new [[#DataItems][data items]]. The basic syntax is
@@ -419, +419 @@

  <i>Note:</i> On flattening, we might end with fields that have the same name 
but which came from different tables. They are disambiguated by prepending 
`<alias>::` to their names. See PigLatinSchemas.
- [[Anchor(ORDER:_Sorting_data_according_to_some_fields)]]
+ [[Anchor(Order)]]
  ==== ORDER: Sorting data according to some fields ====
  We can sort the contents of any alias according to any set of columns. For 
@@ -444, +444 @@

     * However, the only guarantee is that if we retrieve the contents of X 
(see [[#RetrievingR][Retreiving Results]]), they are guaranteed to be in order 
of $2 (the third field).
     * To sort according to the combination of all columns, you can write 
`ORDER A by *` 
- [[Anchor(DISTINCT:_Eliminating_duplicates_in_data)]]
+ [[Anchor(Distinct)]]
  ==== DISTINCT: Eliminating duplicates in data ====
  We can eliminate duplicates in the contents of any alias. For example, 
suppose we first say
@@ -484, +484 @@

     * You can '''not''' request for distinct on a subset of the columns. This 
can be done by [[#ProjectS][projection]] followed by the DISTINCT statement as 
in the above example.
- [[Anchor(CROSS:_Computing_the_cross_product_of_multiple_relations)]]
+ [[Anchor(Cross)]]
  ==== CROSS: Computing the cross product of multiple relations ====
  To compute the cross product (also known as "cartesian product") of two or 
more relations, use:
@@ -511, +511 @@

     * This is an expensive operation and should not be usually necessary.
- [[Anchor(UNION:_Computing_the_union_of_multiple_relations)]]
+ [[Anchor(Union)]]
  ==== UNION: Computing the union of multiple relations ====
  We can vertically glue together contents of multiple aliases into a single 
alias by the UNION command. For example,
@@ -545, +545 @@

        * be able to handle the different kinds of tuples while processing the 
result of the union.
     * UNION does not eliminate duplicate tuples.
- [[Anchor(SPLIT:_Separating_data_into_different_relations)]]
+ [[Anchor(Split)]]
  ==== SPLIT: Separating data into different relations ====
  The SPLIT statement, in some sense, is the converse of the UNION statement. 
It is used to partition the contents of a relation into multiple relations 
based on desired conditions. 
@@ -605, +605 @@

     * Within the nested block, one can do nested filering, projection, 
sorting, and duplicate elimination.
- [[Anchor(Increasing_the_parallelism)]]
+ [[Anchor(Increasing_parallelism)]]
  === Increasing the parallelism ===
  To increase the parallelism of a job, include the PARALLEL clause in any of 
your Pig latin statements.
@@ -634, +634 @@

     * In the current (1.2) and earlier releases, storage functions are case 
sensitive. This will get changes in the future releases.
     * !PigStorage can only store flat tuples, i.e., tuples having atomic 
fields. If you want to store nested data, use !BinStorage instead.
- [[Anchor(Experimenting_with_Pig_Latin_syntax)]]
+ [[Anchor(Experimenting)]]
  === Experimenting with Pig Latin syntax ===
  To experiment with the Pig Latin syntax, you can use the !StandAloneParser. 
Invoke it by the following command:

Reply via email to