[Pig Wiki] Trivial Update of "RunPig" by CorinneC

Apache Wiki Tue, 03 Mar 2009 16:15:41 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by CorinneC:
http://wiki.apache.org/pig/RunPig

------------------------------------------------------------------------------
- This page provides the information you need to get started running Pig.  You 
should have access to a Hadoop cluster and you should have Pig set up (see 
BuildPig ).
+ This page provides the information you need to get started running Pig.
  
- === Environment ===
+ === Run Modes ===
  
- First we need to set up a few things.
+ Pig has two run modes or exectypes, local and hadoop (currently called 
mapreduce). 
+   * '''Local Mode''': To run Pig in local mode, you need access to a single 
machine. 
+   * '''Hadoop (mapreduce) Mode''': To run Pig in hadoop (mapreduce) mode, you 
need access to a Hadoop cluster and HDFS installation.
  
- '''Unix''' and '''Windows''' users need to install and set up Java, including 
$JAVA_HOME.
+ To get a listing of all Pig commands, including the run modes, use: 
+ {{{
+ $ pig âhelp
+ }}}
  
- '''Windows''' users need to install Cygwin and the Perl package 
(http://www.cygwin.com/)
+ Note: A ticket has been entered to change {{{-x, -exectype local|mapreduce}}} 
 to  {{{-x, -exectype local|hadoop}}}
  
+ === Run Ways ===
- To set environment variables, use the right command for your shell (The 
examples use the bash flavor): 
-   * setenv PIGDIR /pig  (tcsh, csh) 
-   * export PIGDIR=/pig (bash, sh, ksh)
  
- In newer versions of Pig, you may also need to set some properties in the
- conf/pig.properties file (in the main pig directory). You may wish to set
- verbose=true until things are up and running.
+ You can run Pig three ways â using either local mode or hadoop (mapreduce) 
mode:
+   * '''Grunt Shell''': Enter Pig commands manually using Pigâs interactive 
shell, Grunt. 
+   * '''Script File''': Place Pig commands in a script file and run the script.
+   * '''Embedded Program''': Embed Pig commands in a host language and run the 
program.
+ 
+ Note: The script file mentioned above is a script that you create and which 
contains the Pig commands that you want to run using Pig (we provide a sample 
script in the next section). Please note, however, that Pig, itself, is also a 
script (pig.sh), and is referred below to as "The Pig Script."
  
  === Sample Code ===
- 
  The sample code files you need to run the examples on this page include: 
    * Script file: attachment:id.pig 
    * Embedded program: attachment:idlocal.java and attachment:idhadoop.java
+  
+ The examples are based on these Pig commands, which extract all user IDs from 
the /etc/passwd file. 
  
- To start, we're going to parse a small text file, namely the /etc/passwd 
file.  (Don't worry -- for arcane reasons there are no passwords in the 
etc/passwd file, only user names and public info.) Copy the passwd file into 
your local directory: 
  {{{ 
- `cp /etc/passwd .`
+ A = load 'passwd' using PigStorage(':'); 
+ B = foreach A generate $0 as id;
+ dump B; 
+ store B into âid.outâ;
  }}}
  
- Your file may look something like this. Fields are separated by colons (:).
  
+ === Environment ===
- {{{ 
- games:x:5:60:games:/usr/games:/bin/sh
- man:x:6:12:man:/var/cache/man:/bin/sh
- lp:x:7:7:lp:/var/spool/lpd:/bin/sh
- mail:x:8:8:mail:/var/mail:/bin/sh
- news:x:9:9:news:/var/spool/news:/bin/sh
- }}}
  
- Now, let's extract all usernames (the first column) from the file.  Here's 
the pig latin to do this:
-  
- {{{ 
- A = load 'passwd' using PigStorage(':'); -- load the text file; each ':' 
gives its own column.
- B = foreach A generate $0 as id;         -- for each in bag A, generate the 
$0 (first) element
- dump B;                                  -- preview it to the screen. A whole 
bunch of (names) will fly by.
- store B into 'id.out';                   -- store it into 'id.out'
- }}}
+ Unix and Windows users need to install and set up Java (including $JAVA_HOME).
+ 
+ Windows users need to install Cygwin and the Perl package 
(http://www.cygwin.com/)
+ 
+ To set environment variables, use the right command for your shell: 
+   * setenv PIGDIR /pig  (tcsh, csh) 
+   * export PIGDIR=/pig (bash, sh, ksh)
+ 
+ The examples use export.
  
  = Local Mode =
+ This section shows you how to run Pig in local mode, using the Grunt shell, a 
Pig script, and an embedded program. 
  
- This section shows you how to run Pig in local mode. 
- 
- To run Pig in local mode, you '''do not''' need access to a hadoop cluster - 
you only need access to a single machine. To make things simple, copy these 
files to your current working directory (you may want to create a temp 
directory and move to it):
+ To run Pig in local mode, you only need access to a single machine. To make 
things simple, copy these files to your current working directory (you may want 
to create a temp directory and move to it):
  
-   * The /etc/passwd file (again: this is just a handy file to parse, it 
doesn't configure anything in pig).
+   * The /etc/passwd file
    * The pig.jar file, created when you build Pig (see BuildPig)
-   * The sample code files (id.pig and idlocal.java) located above.
+   * The sample code files (id.pig and idlocal.java) located on this page
  
  === Grunt Shell ===
  To run Pigâs Grunt shell in local mode, follow these instructions.
@@ -66, +68 @@

  $ export PIG_CLASSPATH=./pig.jar
  }}} 
  
- (1) With the Pig Runner shell file (bin/pig):
+ (1) With the Pig Script
  
  From your current working directory, run:
  {{{
- $ pig -x local -verbose
+ $ pig -x local
  }}}
- That last bit will request that pig log messages to the screen.  Once things 
are working well, remove the `-verbose` flag.
  
- The Grunt shell is invoked and you can enter commands at the prompt.  Type in 
each of the lines for our little column extracting script.
+ The Grunt shell is invoked and you can enter commands at the prompt.
  {{{
  grunt> A = load 'passwd' using PigStorage(':'); 
  grunt> B = foreach A generate $0 as id; 
  grunt> dump B; 
  }}}
  
+ (2) Without the Pig Script
- The last command should produce a nice long list of usernames, each in 
(parens):
- {{{
- (games)
- (man)
- (lp)
- (mail)
- (news)
- }}}
  
+ From your current working directory, run:
- 
- 
- === Run Ways ===
- 
- You can run Pig three ways â using either local mode or hadoop (mapreduce) 
mode:
-   * '''Grunt Shell''': Enter Pig commands manually using Pigâs interactive 
shell, Grunt. (That's what you just did.)
-   * '''Script File''': Place Pig commands in a script file and run the script.
-   * '''Embedded Program''': Embed Pig commands in a host language and run the 
program.
- 
- === Run Modes ===
- 
- Pig has two run modes or exectypes, local and hadoop (currently called 
mapreduce). 
-   * '''Local Mode''': To run Pig in local mode, you need access to a single 
machine. 
-   * '''Hadoop (mapreduce) Mode''': To run Pig in hadoop (mapreduce) mode, you 
need access to a Hadoop cluster and HDFS installation.
- 
- To get a listing of all Pig commands, including the run modes, use: 
  {{{
- $ pig -help
+ $ java -cp pig.jar org.apache.pig.Main -x local
+ Or
+ $ java âjar pig.jar âx local
  }}}
  
- Note (2008 Nov): A ticket has been entered to change {{{-x, -exectype 
local|mapreduce}}}  to  {{{-x, -exectype local|hadoop}}}
+ The Grunt shell is invoked and you can enter commands at the prompt.
  
  === Script File ===
  
  To run a Pig script file in local mode, follow these instructions (which are 
the same as the Grunt Shell instructions above â you just include the script 
file).
  
+ 
  First, point $PIG_CLASSPATH to the pig.jar file (in your current working 
directory):
  {{{
  $ export PIG_CLASSPATH=./pig.jar
  }}}
  
- (1) With the Pig Runner shell file (bin/pig)
+ (1) With the Pig Script
  
  From your current working directory, run:
  
@@ -131, +113 @@

  
  The results are displayed  to your terminal screen.
  
- (2) Without the Pig Runner shell file
+ (2) Without the Pig Script
  
- All the pig script is doing is calling java on the pig.jar.  For identical 
results, run this from your current working directory:
+ From your current working directory, run:
  {{{
  $ java -cp pig.jar org.apache.pig.Main -x local id.pig
  Or
  $ java âjar pig.jar âx local id.pig
  }}}
  
+ The results are displayed to your terminal screen.
+ 
+ 
  === Embedded Program ===
  
  To compile and run an embedded Java/Pig program in local mode, follow these 
instructions. 
@@ -163, +148 @@

  
  This section shows you how to run Pig in hadoop (mapreduce) mode, using the 
Grunt shell, a Pig script, and an embedded program.
  
- To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster. 
Your current working directory should still have the `passwd`, `pig.jar` and 
sample code scripts, just as above.
+ To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster. 
You also need to copy these files to your home or current working directory.
  
+   * The /etc/passwd file
+   * The pig.jar file, created when you build Pig (see BuildPig)
+   * The sample code files (id.pig and idhadoop.java) located on this page
- You should also add the hadoop conf/ directory (the one with your 
hadoop-site.xml) to your environment:
- 
- {{{
- export PIGDIR=/path/to/your/pig-install
- export HADOOPDIR=/path/to/your/hadoop-install/conf
- export PIG_CLASSPATH=$PIGDIR/pig.jar:$HADOOPDIR
- }}}
- 
- In newer (SVN) versions of pig, you should also edit the 
$PIGDIR/conf/pig.properties file, and set 
`cluster=yournamenode.domain.com:port`.
  
  === Grunt Shell ===
- To run Pigâs Grunt shell in hadoop (mapreduce) mode, follow these 
instructions. If you are using HOD, Pig will allocate a 15-node cluster when 
you begin the session and deallocate the nodes when you leave it. 
+ To run Pigâs Grunt shell in hadoop (mapreduce) mode, follow these 
instructions. When you begin the session, Pig will allocate a 15-node cluster. 
When you quit the session, Pig will deallocate the nodes.
  
  From your current working directory, run:
  {{{
- $ pig -x mapreduce -verbose
+ $ pig
  }}}
  
+ The Grunt shell is invoked and you can enter commands at the prompt.
- You should see it first connect to the namenode:
- {{{
- 1    [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting to 
hadoop file system at: hdfs://namenode.your.domain.org:9000
- }}}
- 
- If you see a line like this pig is not correctly finding your cluster.
- {{{
- 2008-12-02 20:53:02,983 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
- }}}
- 
- 
- The Grunt shell is invoked and you can enter commands at the prompt. Let's, 
you guessed it, extract the first column from the text file.  It will be much 
slower (due to the overhead) but way awesomer.
  {{{
  grunt> A = load 'passwd' using PigStorage(':'); 
  grunt> B = foreach A generate $0 as id; 
  grunt> dump B; 
  }}}
  
- Now save the file:
- {{{
- store B into 'id.out';
- }}}
  
- After a burst of activity, the file will emerge on the cluster.  To see it, 
from a shell prompt (not the pig prompt), run:
- {{{
- $ hadoop dfs -ls ./id.out
- -rw-r--r--   3 you supergroup        236 2008-12-02 21:15 
/user/you/id.out/map-000000
- -rw-r--r--   3 you supergroup          0 2008-12-02 21:15 
/user/you/id.out/part-00000
- -rw-r--r--   3 you supergroup          0 2008-12-02 21:15 
/user/you/id.out/part-00001
- ...
- }}}
- Now id.out is a directory, with as many files as there were reduces.  The 
file map-000000 (or whatever) holds the usernames:
- {{{
- $ hadoop dfs -cat ./id.out/map-000000
- games
- man
- lp
- mail
- news
- }}}
  
  === Script File ===
- Running a Pig script file in hadoop (mapreduce) mode is just like the above:
+ To run Pig script files in hadoop (mapreduce) mode, follow these instructions 
(which are the same as the Grunt Shell instructions above â you just include 
the script file). Again, Pig will automatically allocate and deallocate a 
15-node cluster.
+ 
+ From your current working directory, run:
  {{{
  $ pig id.pig
  }}}
+ 
+ The results are displayed  to your terminal screen.
+ 
  
  === Embedded Program ===
  To compile and run an embedded Java/Pig program in hadoop (mapreduce) mode, 
follow these instructions. 
  
+ First, point $HADOOPDIR to the directory that contains the hadoop-site.xml 
file. Example:
+ {{{
+ $ export HADOOPDIR=/yourHADOOPsite/conf 
+ }}}
+ 
  From your current working directory, compile the program:
  {{{
  $ javac -cp pig.jar idhadoop.java
@@ -243, +200 @@

  From your current working directory, run the program:
  {{{
  Unix:   $ java -cp pig.jar:.:$HADOOPDIR idhadoop
- Cygwin: $ java âcp '.;pig.jar;$HADOOPDIR' idhadoop
+ Cygwin: $ java âcp â.;pig.jar;$HADOOPDIRâ idhadoop
  }}}
  
+ To view the results, check the idout directory on your Hadoop system.
+

[Pig Wiki] Trivial Update of "RunPig" by CorinneC

Reply via email to