Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by CorinneC: http://wiki.apache.org/pig/RunPig ------------------------------------------------------------------------------ - This page provides the information you need to get started running Pig. You should have access to a Hadoop cluster and you should have Pig set up (see BuildPig ). + This page provides the information you need to get started running Pig. - === Environment === + === Run Modes === - First we need to set up a few things. + Pig has two run modes or exectypes, local and hadoop (currently called mapreduce). + * '''Local Mode''': To run Pig in local mode, you need access to a single machine. + * '''Hadoop (mapreduce) Mode''': To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation. - '''Unix''' and '''Windows''' users need to install and set up Java, including $JAVA_HOME. + To get a listing of all Pig commands, including the run modes, use: + {{{ + $ pig âhelp + }}} - '''Windows''' users need to install Cygwin and the Perl package (http://www.cygwin.com/) + Note: A ticket has been entered to change {{{-x, -exectype local|mapreduce}}} to {{{-x, -exectype local|hadoop}}} + === Run Ways === - To set environment variables, use the right command for your shell (The examples use the bash flavor): - * setenv PIGDIR /pig (tcsh, csh) - * export PIGDIR=/pig (bash, sh, ksh) - In newer versions of Pig, you may also need to set some properties in the - conf/pig.properties file (in the main pig directory). You may wish to set - verbose=true until things are up and running. + You can run Pig three ways â using either local mode or hadoop (mapreduce) mode: + * '''Grunt Shell''': Enter Pig commands manually using Pigâs interactive shell, Grunt. + * '''Script File''': Place Pig commands in a script file and run the script. + * '''Embedded Program''': Embed Pig commands in a host language and run the program. + + Note: The script file mentioned above is a script that you create and which contains the Pig commands that you want to run using Pig (we provide a sample script in the next section). Please note, however, that Pig, itself, is also a script (pig.sh), and is referred below to as "The Pig Script." === Sample Code === - The sample code files you need to run the examples on this page include: * Script file: attachment:id.pig * Embedded program: attachment:idlocal.java and attachment:idhadoop.java + + The examples are based on these Pig commands, which extract all user IDs from the /etc/passwd file. - To start, we're going to parse a small text file, namely the /etc/passwd file. (Don't worry -- for arcane reasons there are no passwords in the etc/passwd file, only user names and public info.) Copy the passwd file into your local directory: {{{ - `cp /etc/passwd .` + A = load 'passwd' using PigStorage(':'); + B = foreach A generate $0 as id; + dump B; + store B into âid.outâ; }}} - Your file may look something like this. Fields are separated by colons (:). + === Environment === - {{{ - games:x:5:60:games:/usr/games:/bin/sh - man:x:6:12:man:/var/cache/man:/bin/sh - lp:x:7:7:lp:/var/spool/lpd:/bin/sh - mail:x:8:8:mail:/var/mail:/bin/sh - news:x:9:9:news:/var/spool/news:/bin/sh - }}} - Now, let's extract all usernames (the first column) from the file. Here's the pig latin to do this: - - {{{ - A = load 'passwd' using PigStorage(':'); -- load the text file; each ':' gives its own column. - B = foreach A generate $0 as id; -- for each in bag A, generate the $0 (first) element - dump B; -- preview it to the screen. A whole bunch of (names) will fly by. - store B into 'id.out'; -- store it into 'id.out' - }}} + Unix and Windows users need to install and set up Java (including $JAVA_HOME). + + Windows users need to install Cygwin and the Perl package (http://www.cygwin.com/) + + To set environment variables, use the right command for your shell: + * setenv PIGDIR /pig (tcsh, csh) + * export PIGDIR=/pig (bash, sh, ksh) + + The examples use export. = Local Mode = + This section shows you how to run Pig in local mode, using the Grunt shell, a Pig script, and an embedded program. - This section shows you how to run Pig in local mode. - - To run Pig in local mode, you '''do not''' need access to a hadoop cluster - you only need access to a single machine. To make things simple, copy these files to your current working directory (you may want to create a temp directory and move to it): + To run Pig in local mode, you only need access to a single machine. To make things simple, copy these files to your current working directory (you may want to create a temp directory and move to it): - * The /etc/passwd file (again: this is just a handy file to parse, it doesn't configure anything in pig). + * The /etc/passwd file * The pig.jar file, created when you build Pig (see BuildPig) - * The sample code files (id.pig and idlocal.java) located above. + * The sample code files (id.pig and idlocal.java) located on this page === Grunt Shell === To run Pigâs Grunt shell in local mode, follow these instructions. @@ -66, +68 @@ $ export PIG_CLASSPATH=./pig.jar }}} - (1) With the Pig Runner shell file (bin/pig): + (1) With the Pig Script From your current working directory, run: {{{ - $ pig -x local -verbose + $ pig -x local }}} - That last bit will request that pig log messages to the screen. Once things are working well, remove the `-verbose` flag. - The Grunt shell is invoked and you can enter commands at the prompt. Type in each of the lines for our little column extracting script. + The Grunt shell is invoked and you can enter commands at the prompt. {{{ grunt> A = load 'passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> dump B; }}} + (2) Without the Pig Script - The last command should produce a nice long list of usernames, each in (parens): - {{{ - (games) - (man) - (lp) - (mail) - (news) - }}} + From your current working directory, run: - - - === Run Ways === - - You can run Pig three ways â using either local mode or hadoop (mapreduce) mode: - * '''Grunt Shell''': Enter Pig commands manually using Pigâs interactive shell, Grunt. (That's what you just did.) - * '''Script File''': Place Pig commands in a script file and run the script. - * '''Embedded Program''': Embed Pig commands in a host language and run the program. - - === Run Modes === - - Pig has two run modes or exectypes, local and hadoop (currently called mapreduce). - * '''Local Mode''': To run Pig in local mode, you need access to a single machine. - * '''Hadoop (mapreduce) Mode''': To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation. - - To get a listing of all Pig commands, including the run modes, use: {{{ - $ pig -help + $ java -cp pig.jar org.apache.pig.Main -x local + Or + $ java âjar pig.jar âx local }}} - Note (2008 Nov): A ticket has been entered to change {{{-x, -exectype local|mapreduce}}} to {{{-x, -exectype local|hadoop}}} + The Grunt shell is invoked and you can enter commands at the prompt. === Script File === To run a Pig script file in local mode, follow these instructions (which are the same as the Grunt Shell instructions above â you just include the script file). + First, point $PIG_CLASSPATH to the pig.jar file (in your current working directory): {{{ $ export PIG_CLASSPATH=./pig.jar }}} - (1) With the Pig Runner shell file (bin/pig) + (1) With the Pig Script From your current working directory, run: @@ -131, +113 @@ The results are displayed to your terminal screen. - (2) Without the Pig Runner shell file + (2) Without the Pig Script - All the pig script is doing is calling java on the pig.jar. For identical results, run this from your current working directory: + From your current working directory, run: {{{ $ java -cp pig.jar org.apache.pig.Main -x local id.pig Or $ java âjar pig.jar âx local id.pig }}} + The results are displayed to your terminal screen. + + === Embedded Program === To compile and run an embedded Java/Pig program in local mode, follow these instructions. @@ -163, +148 @@ This section shows you how to run Pig in hadoop (mapreduce) mode, using the Grunt shell, a Pig script, and an embedded program. - To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster. Your current working directory should still have the `passwd`, `pig.jar` and sample code scripts, just as above. + To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster. You also need to copy these files to your home or current working directory. + * The /etc/passwd file + * The pig.jar file, created when you build Pig (see BuildPig) + * The sample code files (id.pig and idhadoop.java) located on this page - You should also add the hadoop conf/ directory (the one with your hadoop-site.xml) to your environment: - - {{{ - export PIGDIR=/path/to/your/pig-install - export HADOOPDIR=/path/to/your/hadoop-install/conf - export PIG_CLASSPATH=$PIGDIR/pig.jar:$HADOOPDIR - }}} - - In newer (SVN) versions of pig, you should also edit the $PIGDIR/conf/pig.properties file, and set `cluster=yournamenode.domain.com:port`. === Grunt Shell === - To run Pigâs Grunt shell in hadoop (mapreduce) mode, follow these instructions. If you are using HOD, Pig will allocate a 15-node cluster when you begin the session and deallocate the nodes when you leave it. + To run Pigâs Grunt shell in hadoop (mapreduce) mode, follow these instructions. When you begin the session, Pig will allocate a 15-node cluster. When you quit the session, Pig will deallocate the nodes. From your current working directory, run: {{{ - $ pig -x mapreduce -verbose + $ pig }}} + The Grunt shell is invoked and you can enter commands at the prompt. - You should see it first connect to the namenode: - {{{ - 1 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://namenode.your.domain.org:9000 - }}} - - If you see a line like this pig is not correctly finding your cluster. - {{{ - 2008-12-02 20:53:02,983 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// - }}} - - - The Grunt shell is invoked and you can enter commands at the prompt. Let's, you guessed it, extract the first column from the text file. It will be much slower (due to the overhead) but way awesomer. {{{ grunt> A = load 'passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> dump B; }}} - Now save the file: - {{{ - store B into 'id.out'; - }}} - After a burst of activity, the file will emerge on the cluster. To see it, from a shell prompt (not the pig prompt), run: - {{{ - $ hadoop dfs -ls ./id.out - -rw-r--r-- 3 you supergroup 236 2008-12-02 21:15 /user/you/id.out/map-000000 - -rw-r--r-- 3 you supergroup 0 2008-12-02 21:15 /user/you/id.out/part-00000 - -rw-r--r-- 3 you supergroup 0 2008-12-02 21:15 /user/you/id.out/part-00001 - ... - }}} - Now id.out is a directory, with as many files as there were reduces. The file map-000000 (or whatever) holds the usernames: - {{{ - $ hadoop dfs -cat ./id.out/map-000000 - games - man - lp - mail - news - }}} === Script File === - Running a Pig script file in hadoop (mapreduce) mode is just like the above: + To run Pig script files in hadoop (mapreduce) mode, follow these instructions (which are the same as the Grunt Shell instructions above â you just include the script file). Again, Pig will automatically allocate and deallocate a 15-node cluster. + + From your current working directory, run: {{{ $ pig id.pig }}} + + The results are displayed to your terminal screen. + === Embedded Program === To compile and run an embedded Java/Pig program in hadoop (mapreduce) mode, follow these instructions. + First, point $HADOOPDIR to the directory that contains the hadoop-site.xml file. Example: + {{{ + $ export HADOOPDIR=/yourHADOOPsite/conf + }}} + From your current working directory, compile the program: {{{ $ javac -cp pig.jar idhadoop.java @@ -243, +200 @@ From your current working directory, run the program: {{{ Unix: $ java -cp pig.jar:.:$HADOOPDIR idhadoop - Cygwin: $ java âcp '.;pig.jar;$HADOOPDIR' idhadoop + Cygwin: $ java âcp â.;pig.jar;$HADOOPDIRâ idhadoop }}} + To view the results, check the idout directory on your Hadoop system. +