Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

------------------------------------------------------------------------------
   1. Define an environment variable with the location of the Pig JAR file. For 
example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  
  
- == Pig Scripts - Local Mode ==
+ == Pig Script Installation and Run - Local Mode ==
  To install and run the Pig scripts in local mode, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial file is described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory.
+  1. Review the contents of the Pig tutorial file.
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
  }}}
  
-  1.#3 Review the results:
+  1.#4 Review the results:
  {{{
  $ ls -l /tmp/ngrams.txt
  }}}
  
  
- == Pig Scripts - Hadoop Cluster ==
+ == Pig Script Installation and Run - Hadoop Cluster ==
  To install and run the Pig scripts on a Hadoop cluster, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial file is described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory. 
+  1. Review the contents of the Pig tutorial file.
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .
  }}}
-  1.#3 Set the HADOOPSITEPATH environment variable to the location of your 
hadoop-site.xml file.
+  1.#4 Set the HADOOPSITEPATH environment variable to the location of your 
hadoop-site.xml file.
   1. Execute the following command (using either tutorial.pig or 
tutorial-join.pig):
  {{{
  $ java -cp pig.jar:$HADOOPSITEPATH org.apache.pig.Main tutorial.pig
  }}}
-  1.#5 Review the results:
+  1.#6 Review the results:
  {{{
  $ hadoop dfs -ls ngrams.txt
  }}}
@@ -65, +67 @@

  The contents of the Pig tutorial file (*.gz) are described here.
  || '''File''' || '''Description'''||
  || tutorial.jar|| User-defined functions (UDFs) ||
- || tutorial.pig || Tutorial-1 (run on Hadoop) ||
+ || tutorial.pig || Tutorial pig script (Hadoop) ||
- || tutorial-local.pig ||Tutorail-1 (run in local mode) ||
+ || tutorial-local.pig ||Tutorail pig script(local mode) ||
- || tutorial-join.pig || Tutorial-2 (run on Hadoop) ||
+ || tutorial-join.pig || Tutorial-join pig script(Hadoop) ||
- || tutorial-join-local.pig || Tutorial-2 (run in local mode) ||
+ || tutorial-join-local.pig || Tutorial-join pig script(local mode) ||
- || excite.log || Data file (for runs on Hadoop) ||
+ || excite.log || Data file (Hadoop) ||
- || excite-small.log || Data file (for runs in local mode) ||
+ || excite-small.log || Data file (local mode) ||
  || pornwords || Data file (porn keywords) ||
  
  The user-defined functions (UDFs) are described here.
  
  || '''UDF''' || '''Description'''||
  || !ExtractHour || Extracts the hour from the record.||
- || N!GramGenerator || Extracts n-grams from the set of words. ||
+ || N!GramGenerator || Composes n-grams from the set of words. ||
  || !NonPornDetector|| Removes the record if the query field includes porn 
terms. ||
  || NonURLDetector || Removes the record if the query field is empty or a URL. 
||
  || !ScoreGenerator || Calculates a "popularity" score for the n-gram.||
@@ -85, +87 @@

  || !TutorialUtil || Divides the query string into a set of words.||
  
  
- == Pig Tutorial-1 ==
+ == Tutorial Pig Script ==
  
- The Pig Tutorial-1 script (tutorial.pig or tutorial-local.pig) does the 
following:
+ The tutorial pig script (tutorial.pig or tutorial-local.pig) does the 
following:
  
-  * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
+  * Registers the tutorial JAR file so that the user-defined functions (UDFs) 
can be called in the script. 
-  * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
+  * Uses the !PigStorage function to load the excite log file (excite.log or 
excite-small.log) into the “raw” bag as an array of records with the fields 
'''user''', '''time''', and '''query'''. 
-  * Calls the NonURLDetector UDF to remove any records where the query field 
is empty or is an URL.
+  * Calls the NonURLDetector UDF to remove records if the query field is empty 
or a URL. 
-  * Calls the !ToLower UDF to lower-case the query.
-  * Calls the !NonPornDetector UDF to remove porno query terms (defined by a 
list in a text file).
+  * Calls the !ToLower UDF to change the query field to lowercase.
+  * Calls the !NonPornDetector UDF to remove records if the query field 
contains porn terms. 
-  * Calls the !ExtractHour UDF to extract the hour from the time of the record.
+  * Calls the !ExtractHour UDF to extract the hour from the time field.
-  * Calls the N!GramGenerator UDF to compose the n-grams of the query.
+  * Calls the N!GramGenerator UDF to compose the n-grams of the query. 
-  * Calls the DISTINCT operator to get the unique n-grams for all records.
+  * Uses the DISTINCT command to get the unique n-grams for all records. 
+  * Uses the GROUP command to group records by n-gram and hour.
-  * Gets the count (occurrences) of each n-gram.
+  * Uses the COUNT function to get the count (occurrences) of each n-gram.
+  * Uses the GROUP command to group records by n-gram only.
   * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the 
n-gram.
+  * Uses the GENERATE command to assign names to the fields.
-  * Removes all records with a score less than or equal to 2.0.
+  * Uses the FILTER command to move all records with a score less than or 
equal to 2.0.
-  * Sorts the remaining records by hour and score.
+  * Uses the ORDER command to sort the remaining records by hour and score.
-  * Saves the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean'''
+  * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', '''ngram''', 
'''score''', '''count''', '''mean''' 
  
- == Pig Tutorial-2 ==
  
- The Pig Tutorial-2 script (tutorial-join.pig or tutorial-join-local.pig) does 
the following:
+ == Tutorial-Join Pig Script ==
  
+ The tutorial-join pig script (tutorial-join.pig or tutorial-join-local.pig) 
does the following:
-  * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
-  * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
-  * Calls the NonURLDetector UDF to remove any records where the query field 
is empty or is an URL.
-  * Calls the !ToLower UDF to lower-case the query.
-  * Calls the !NonPornDetector UDF to remove porno query terms (defined by a 
list in a text file).
-  * Calls the !ExtractHour UDF to extract the hour from the time of the record.
-  * Calls the N!GramGenerator UDF to compose the n-grams of the query.
-  * Calls the DISTINCT operator to get the unique n-grams for all records.
-  * Calls the GROUP operator to group the records by n-gram and hour.
-  * Calls the FOREACH operator to count the size of each group.
-  * Calls the FILTER operator to get the n-grams for hour ‘00’ 
-  * Calls the FILTER operator to get the n-grams for hour ‘12’
-  * Calls the JOIN operator to get and count the n-grams in both “00” and 
“12”
-  * Saves the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''count00''', '''count12'''
  
+  * Registers the tutorial JAR file so that the user-defined functions (UDFs) 
can be called in the script. 
+  * Uses the !PigStorage function to load the excite log file (excite.log or 
excite-small.log) into the “raw” bag as an array of records with the fields 
'''user''', '''time''', and '''query'''. 
+  * Calls the NonURLDetector UDF to remove records if the query field is empty 
or a URL. 
+  * Calls the !ToLower UDF to change the query field to lowercase.
+  * Calls the Non!PornDetector UDF to remove records if the query field 
contains porn terms. 
+  * Calls the !ExtractHour UDF to extract the hour from the time field.
+  * Calls the N!GramGenerator UDF to compose the n-grams of the query. 
+  * Uses the DISTINCT command to get the unique n-grams for all records. 
+  * Uses the GROUP command to group the records by n-gram and hour. 
+  * Uses the COUNT function to get the count (occurrences) of each n-gram. 
+  * Uses the GENERATE command to assign names to the fields.
+  * Uses the FILTER command to get the n-grams for hour ‘00’ 
+  * Uses the FILTER command to get the n-grams for hour ‘12’ 
+  * Uses the JOIN command to join the n-grams in hour “00” and  hour 
“12” by field $0
+  * Uses the COUNT function to get the count (occurrences) of the n-grams in 
both “00” and “12” 
+  * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', 
'''count00''', '''count12'''
+ 

Reply via email to