Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

------------------------------------------------------------------------------
  (''page in progress ...'')
  
  
- == Running the Pig Tutorials ==
+ == Pig Tutorials ==
- You can run the Pig tutorials in local mode or on a Hadoop cluster:
+ The Pig tutorial shows you how to run Pig scripts in local mode or on a 
Hadoop cluster.
+ 
-  * To run Pig in local mode, no Hadoop or DFS installation is required. All 
files are installed and run from your local host and file system.
+  * To run the scripts in local mode, no Hadoop or DFS installation is 
required. All files are installed and run from your local host and file system.
-  * To run Pig on a Hadoop cluster, you need access to a Hadoop cluster and 
DFS installation.
+  * To run the scripts on a Hadoop cluster, you need access to a Hadoop 
cluster and DFS installation.
- The Pig JAR file and the Pig tutorial file include everything you need to run 
in local mode or on a Hadoop cluster.
+ 
+ The Pig JAR file (pig.jar) and the Pig tutorial file include everything you 
need to get started.
  
  === Java Installation ===
- Your run-time environment should include '''Java 1.5'''. Set the JAVA_HOME 
environment variable to the root of your Java installation. 
+ Your run-time environment should include '''Java 1.5'''. 
+ Set the JAVA_HOME environment variable to the root of your Java installation. 
  
  
  === Pig Installation ===
  To install Pig, do the following:
  
-  1. Download the Pig JAR file (pig.jar) and move it to the appropriate 
directory. For example, /home/me/pig. 
+  1. Download the Pig JAR file ('''pig.jar''') and move it to the appropriate 
directory. For example, /home/me/pig. 
   1. Define an environment variable with the location of the Pig JAR file. For 
example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  
  
- === Running the Pig Tutorials in Local Mode ===
+ === Pig Scripts and Local Mode ===
- To run the Pig tutorial in local mode, do the following:
+ To install and run the Pig scripts in local mode, do the following:
  
   1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
@@ -37, +40 @@

  }}}
  
  
- === Running the Pig Tutorials on a Hadoop Cluster ===
+ === Pig Scripts and Hadoop Cluster ===
- To run the Pig tutorial on a Hadoop cluster, do the following:
+ To install and run the Pig scripts on a Hadoop cluster, do the following:
  
   1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
   1. Copy the exite.log file to your DFS directory.
@@ -78, +81 @@

  || !ToLower || Switches the query field to lowercase. ||
  || !TutorialUtil || Divides the query string into a set of words.||
  
+ 
+ == Pig Tutorial-1 ==
+ 
+ Pig Tutorial-1 (tutorial.pig or tutorial-local.pig) does the following:
+ 
+  * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
+  * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
+  * Calls the NonURLDetector UDF to remove any records where the query field 
is empty or is an URL.
+  * Calls the !ToLower UDF to lower-case the query.
+  * Calls the !NonPornDetector UDF to remove porno query terms (defined by a 
list in a text file).
+  * Calls the !ExtractHour UDF to extract the hour from the time of the record.
+  * Calls the N!GramGenerator UDF to compose the n-grams of the query.
+  * Calls the DISTINCT operator to get the unique n-grams for all records.
+  * Gets the count (occurrences) of each n-gram.
+  * Calls the !ScoreGenerator UDF to calculate a popularity score for the 
n-gram.
+  * Removes all records with a score less than or equal to 2.0.
+  * Sorts the remaining records by hour and score.
+  * Saves the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean'''
+ 
+ == Pig Tutorial-2 ==
+ 
+ Pig Tutorial-2 (tutorial-join.pig or tutorial-join-local.pig) does the 
following:
+ 
+  * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
+  * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
+  * Calls the NonURLDetector UDF to remove any records where the query field 
is empty or is an URL.
+  * Calls the !ToLower UDF to lower-case the query.
+  * Calls the !NonPornDetector UDF to remove porno query terms (defined by a 
list in a text file).
+  * Calls the !ExtractHour UDF to extract the hour from the time of the record.
+  * Calls the N!GramGenerator UDF to compose the n-grams of the query.
+  * Calls the DISTINCT operator to get the unique n-grams for all records.
+  * Calls the GROUP operator to group the records by n-gram and hour.
+  * Calls the FOREACH operator to count the size of each group.
+  * Calls the FILTER operator to get the n-grams for hour ‘00’ 
+  * Calls the FILTER operator to get the n-grams for hour ‘12’
+  * Calls the JOIN operator to get and count the n-grams in both “00” and 
“12”
+  * Saves the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''count00''', '''count12'''
+ 

Reply via email to