[Pig Wiki] Update of "RunPig" by FlipKromer

Apache Wiki Tue, 02 Dec 2008 20:12:22 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by FlipKromer:
http://wiki.apache.org/pig/RunPig

The comment on the change is:
Pointed out a couple pitfalls; made it slightly more of a tutorial description

------------------------------------------------------------------------------
  This page provides the information you need to get started running Pig.
+ 
+ === Environment ===
+ 
+ First we need to set up a few things.
+ 
+ Unix and Windows users need to install and set up Java (including 
$JAVA_HOME). Use Sun Java 6 if at all possible.
+ 
+ Windows users need to install Cygwin and the Perl package 
(http://www.cygwin.com/)
+ 
+ To set environment variables, use the right command for your shell (The 
examples use the bash flavor): 
+   * setenv PIGDIR /pig  (tcsh, csh) 
+   * export PIGDIR=/pig (bash, sh, ksh)
+ 
+ In newer versions of pig, you may also need to set some properties in the
+ conf/pig.properties file (in the main pig directory). You may wish to set
+ verbose=true until things are up and running.
+ 
+ === Sample Code ===
+ 
+ The sample code files you need to run the examples on this page include: 
+   * Script file: attachment:id.pig 
+   * Embedded program: attachment:idlocal.java and attachment:idhadoop.java
+ 
+ To start, we're going to parse a small text file, namely the /etc/passwd 
file.  (Don't worry -- for arcane reasons there are no passwords in the 
etc/passwd file, only user names and public info.  Windows users, just paste 
from the snippet below)  Copy that file into the local directory: `cp 
/etc/passwd .`
+ 
+ Yours may looks something like this:
+ 
+ {{{ 
+ games:x:5:60:games:/usr/games:/bin/sh
+ man:x:6:12:man:/var/cache/man:/bin/sh
+ lp:x:7:7:lp:/var/spool/lpd:/bin/sh
+ mail:x:8:8:mail:/var/mail:/bin/sh
+ news:x:9:9:news:/var/spool/news:/bin/sh
+ }}}
+ 
+ Now, let's extract all usernames (the first column) from the file.  Here's 
the pig latin to do this:
+  
+ {{{ 
+ A = load 'passwd' using PigStorage(':'); -- load the text file; each ':' 
gives its own column.
+ B = foreach A generate $0 as id;         -- for each in bag A, generate the 
$0 (first) element
+ dump B;                                  -- preview it to the screen. A whole 
bunch of (names) will fly by.
+ store B into 'id.out';                   -- store it into 'id.out'
+ }}}
+ 
+ = Local Mode =
+ 
+ Start simple: Pig in local mode, using the Grunt shell. Later we'll look at 
running a Pig script, then an embedded program, then all of the above across a 
hadoop cluster. 
+ 
+ To run Pig in local mode, you only need access to a single machine. To make 
things simple, copy these files to your current working directory (you may want 
to create a temp directory and move to it):
+ 
+   * The /etc/passwd file (again: this is just a handy file to parse, it 
doesn't configure anything in pig).
+   * The pig.jar file, created when you build Pig (see BuildPig)
+   * The sample code files (id.pig and idlocal.java) located above.
+ 
+ === Grunt Shell ===
+ To run Pigâs Grunt shell in local mode, follow these instructions.
+ 
+ First, point $PIG_CLASSPATH to the pig.jar file (in your current working 
directory):
+ {{{
+ $ export PIG_CLASSPATH=./pig.jar
+ }}} 
+ 
+ (1) With the Pig Runner shell file (bin/pig):
+ 
+ From your current working directory, run:
+ {{{
+ $ pig -x local -verbose
+ }}}
+ That last bit will request that pig log messages to the screen.  Once things 
are working well, remove the `-verbose` flag.
+ 
+ The Grunt shell is invoked and you can enter commands at the prompt.  Type in 
each of the lines for our little column extracting script.
+ {{{
+ grunt> A = load 'passwd' using PigStorage(':'); 
+ grunt> B = foreach A generate $0 as id; 
+ grunt> dump B; 
+ }}}
+ 
+ The last command should produce a nice long list of usernames, each in 
(parens):
+ {{{
+ (games)
+ (man)
+ (lp)
+ (mail)
+ (news)
+ }}}
+ 
+ Awesome!
+ 
+ === Run Ways ===
+ 
+ You can run Pig three ways â using either local mode or hadoop (mapreduce) 
mode:
+   * '''Grunt Shell''': Enter Pig commands manually using Pigâs interactive 
shell, Grunt. (That's what you just did.)
+   * '''Script File''': Place Pig commands in a script file and run the script.
+   * '''Embedded Program''': Embed Pig commands in a host language and run the 
program.
  
  === Run Modes ===
  
@@ -10, +104 @@

  
  To get a listing of all Pig commands, including the run modes, use: 
  {{{
- $ pig âhelp
+ $ pig -help
  }}}
  
- Note: A ticket has been entered to change {{{-x, -exectype local|mapreduce}}} 
 to  {{{-x, -exectype local|hadoop}}}
+ Note (2008 Nov): A ticket has been entered to change {{{-x, -exectype 
local|mapreduce}}}  to  {{{-x, -exectype local|hadoop}}}
  
- === Run Ways ===
+ === Script File ===
  
+ To run a Pig script file in local mode, follow these instructions (which are 
the same as the Grunt Shell instructions above â you just include the script 
file).
- You can run Pig three ways â using either local mode or hadoop (mapreduce) 
mode:
-   * '''Grunt Shell''': Enter Pig commands manually using Pigâs interactive 
shell, Grunt. 
-   * '''Script File''': Place Pig commands in a script file and run the script.
-   * '''Embedded Program''': Embed Pig commands in a host language and run the 
program.
  
- Note: The script file mentioned above is a script that you create and which 
contains the Pig commands that you want to run using Pig (we provide a sample 
script in the next section). Please note, however, that Pig, itself, is also a 
script (pig.sh), and is referred below to as "The Pig Script."
+ First, point $PIG_CLASSPATH to the pig.jar file (in your current working 
directory):
+ {{{
+ $ export PIG_CLASSPATH=./pig.jar
+ }}}
  
+ (1) With the Pig Runner shell file (bin/pig)
+ 
+ From your current working directory, run:
- === Sample Code ===
- The sample code files you need to run the examples on this page include: 
-   * Script file: attachment:id.pig 
-   * Embedded program: attachment:idlocal.java and attachment:idhadoop.java
-  
- The examples are based on these Pig commands, which extract all user IDs from 
the /etc/passwd file. 
  
  {{{ 
+ $ pig -x local id.pig
- A = load 'passwd' using PigStorage(':'); 
- B = foreach A generate $0 as id;
- dump B; 
- store B into âid.outâ;
  }}}
  
+ The results are displayed  to your terminal screen.
  
- === Environment ===
+ (2) Without the Pig Runner shell file
  
- Unix and Windows users need to install and set up Java (including $JAVA_HOME).
+ All the pig script is doing is calling java on the pig.jar.  For identical 
results, run this from your current working directory:
+ {{{
+ $ java -cp pig.jar org.apache.pig.Main -x local id.pig
+ Or
+ $ java âjar pig.jar âx local id.pig
+ }}}
  
- Windows users need to install Cygwin and the Perl package 
(http://www.cygwin.com/)
+ === Embedded Program ===
  
+ To compile and run an embedded Java/Pig program in local mode, follow these 
instructions. 
- To set environment variables, use the right command for your shell: 
-   * setenv PIGDIR /pig  (tcsh, csh) 
-   * export PIGDIR=/pig (bash, sh, ksh)
  
- The examples use export.
+ From your current working directory, compile the program:
+ {{{
+ $ javac -cp pig.jar idlocal.java
+ }}}
  
- = Local Mode =
+ Note: idlocal.class is written to your current working directory. Include 
â.â in the class path when you run the program.
+ 
+ From your current working directory, run the program:
+ {{{
+ Unix:   $ java -cp pig.jar:. idlocal
+ Cygwin: $ java âcp â.;pig.jarâ idlocal
+ }}}
+ 
+ To view the results, check the output file, id.out.
+ 
+ = Hadoop Mode =
+ 
- This section shows you how to run Pig in local mode, using the Grunt shell, a 
Pig script, and an embedded program. 
+ This section shows you how to run Pig in hadoop (mapreduce) mode, using the 
Grunt shell, a Pig script, and an embedded program.
  
- To run Pig in local mode, you only need access to a single machine. To make 
things simple, copy these files to your current working directory (you may want 
to create a temp directory and move to it):
+ To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster. 
Your current working directory should still have the `passwd`, `pig.jar` and 
sample code scripts, just as above.
  
-   * The /etc/passwd file
-   * The pig.jar file, created when you build Pig (see BuildPig)
-   * The sample code files (id.pig and idlocal.java) located on this page
+ You should also add the hadoop conf/ directory (the one with your 
hadoop-site.xml) to your environment:
+ 
+ {{{
+ export PIGDIR=/path/to/your/pig-install
+ export HADOOPDIR=/path/to/your/hadoop-install/conf
+ export PIG_CLASSPATH=$PIGDIR/pig.jar:$HADOOPDIR
+ }}}
+ 
+ In newer (SVN) versions of pig, you should also edit the 
$PIGDIR/conf/pig.properties file, and set 
`cluster=yournamenode.domain.com:port`.
  
  === Grunt Shell ===
+ To run Pigâs Grunt shell in hadoop (mapreduce) mode, follow these 
instructions. If you are using HOD, Pig will allocate a 15-node cluster when 
you begin the session and deallocate the nodes when you leave it. 
- To run Pigâs Grunt shell in local mode, follow these instructions.
- 
- First, point $PIG_CLASSPATH to the pig.jar file (in your current working 
directory):
- {{{
- $ export PIG_CLASSPATH=./pig.jar
- }}} 
- 
- (1) With the Pig Script
  
  From your current working directory, run:
  {{{
- $ pig -x local
+ $ pig -x mapreduce -verbose
  }}}
+ (in newer versions run `pig -x hadoop`)
  
- The Grunt shell is invoked and you can enter commands at the prompt.
+ You should see it first connect to the namenode:
+ {{{
+ 1    [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting to 
hadoop file system at: hdfs://namenode.your.domain.org:9000
+ }}}
+ 
+ If you see a line like
+ {{{
+ 2008-12-02 20:53:02,983 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
+ }}}
+ pig is not correctly finding your cluster.
+ 
+ The Grunt shell is invoked and you can enter commands at the prompt. Let's, 
you guessed it, extract the first column from the text file.  It will be much 
slower (due to the overhead) but way awesomer.
  {{{
  grunt> A = load 'passwd' using PigStorage(':'); 
  grunt> B = foreach A generate $0 as id; 
  grunt> dump B; 
  }}}
  
+ Now save the file:
- (2) Without the Pig Script
- 
- From your current working directory, run:
  {{{
+ store B into 'id.out';
- $ java -cp pig.jar org.apache.pig.Main -x local
- Or
- $ java âjar pig.jar âx local
  }}}
  
- The Grunt shell is invoked and you can enter commands at the prompt.
+ After a burst of activity, the file will emerge on the cluster.  To see it, 
from a shell prompt (not the pig prompt), run:
+ {{{
+ $ hadoop dfs -ls ./id.out
+ -rw-r--r--   3 you supergroup        236 2008-12-02 21:15 
/user/you/id.out/map-000000
+ -rw-r--r--   3 you supergroup          0 2008-12-02 21:15 
/user/you/id.out/part-00000
+ -rw-r--r--   3 you supergroup          0 2008-12-02 21:15 
/user/you/id.out/part-00001
+ ...
+ }}}
+ Now id.out is a directory, with as many files as there were reduces.  The 
file map-000000 (or whatever) holds the usernames:
+ {{{
+ $ hadoop dfs -cat ./id.out/map-000000
+ games
+ man
+ lp
+ mail
+ news
+ }}}
  
  === Script File ===
+ Running a Pig script file in hadoop (mapreduce) mode is just like the above:
- 
- To run a Pig script file in local mode, follow these instructions (which are 
the same as the Grunt Shell instructions above â you just include the script 
file).
- 
- 
- First, point $PIG_CLASSPATH to the pig.jar file (in your current working 
directory):
- {{{
- $ export PIG_CLASSPATH=./pig.jar
- }}}
- 
- (1) With the Pig Script
- 
- From your current working directory, run:
- 
- {{{ 
- $ pig -x local id.pig
- }}}
- 
- The results are displayed  to your terminal screen.
- 
- (2) Without the Pig Script
- 
- From your current working directory, run:
- {{{
- $ java -cp pig.jar org.apache.pig.Main -x local id.pig
- Or
- $ java âjar pig.jar âx local id.pig
- }}}
- 
- The results are displayed to your terminal screen.
- 
- 
- === Embedded Program ===
- 
- To compile and run an embedded Java/Pig program in local mode, follow these 
instructions. 
- 
- From your current working directory, compile the program:
- {{{
- $ javac -cp pig.jar idlocal.java
- }}}
- 
- Note: idlocal.class is written to your current working directory. Include 
â.â in the class path when you run the program.
- 
- From your current working directory, run the program:
- {{{
- Unix:   $ java -cp pig.jar:. idlocal
- Cygwin: $ java âcp â.;pig.jarâ idlocal
- }}}
- 
- To view the results, check the output file, id.out.
- 
- = Hadoop Mode =
- 
- This section shows you how to run Pig in hadoop (mapreduce) mode, using the 
Grunt shell, a Pig script, and an embedded program.
- 
- To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster. 
You also need to copy these files to your home or current working directory.
- 
-   * The /etc/passwd file
-   * The pig.jar file, created when you build Pig (see BuildPig)
-   * The sample code files (id.pig and idhadoop.java) located on this page
- 
- === Grunt Shell ===
- To run Pigâs Grunt shell in hadoop (mapreduce) mode, follow these 
instructions. When you begin the session, Pig will allocate a 15-node cluster. 
When you quit the session, Pig will deallocate the nodes.
- 
- From your current working directory, run:
- {{{
- $ pig
- }}}
- 
- The Grunt shell is invoked and you can enter commands at the prompt.
- {{{
- grunt> A = load 'passwd' using PigStorage(':'); 
- grunt> B = foreach A generate $0 as id; 
- grunt> dump B; 
- }}}
- 
- 
- 
- === Script File ===
- To run Pig script files in hadoop (mapreduce) mode, follow these instructions 
(which are the same as the Grunt Shell instructions above â you just include 
the script file). Again, Pig will automatically allocate and deallocate a 
15-node cluster.
- 
- From your current working directory, run:
  {{{
  $ pig id.pig
  }}}
- 
- The results are displayed  to your terminal screen.
- 
  
  === Embedded Program ===
  To compile and run an embedded Java/Pig program in hadoop (mapreduce) mode, 
follow these instructions. 
  
- First, point $HADOOPDIR to the directory that contains the hadoop-site.xml 
file. Example:
- {{{
- $ export HADOOPDIR=/yourHADOOPsite/conf 
- }}}
- 
  From your current working directory, compile the program:
  {{{
  $ javac -cp pig.jar idhadoop.java
@@ -200, +241 @@

  From your current working directory, run the program:
  {{{
  Unix:   $ java -cp pig.jar:.:$HADOOPDIR idhadoop
- Cygwin: $ java âcp â.;pig.jar;$HADOOPDIRâ idhadoop
+ Cygwin: $ java âcp '.;pig.jar;$HADOOPDIR' idhadoop
  }}}
  
- To view the results, check the idout directory on your Hadoop system.
-

[Pig Wiki] Update of "RunPig" by FlipKromer

Reply via email to