Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by CorinneC: http://wiki.apache.org/pig/FAQ ------------------------------------------------------------------------------ - '''1. I'm using `PigStorage` to parse my input files. Can I make it use control characters as delimiters?''' + '''Q: How can I load data using Unicode control characters as delimiters?''' - Yes. The first parameter to `PigStorage` is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html java.util.regex.Pattern] for more information on the way to use special characters in regex. For example, + The first parameter to !PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html java.util.regex.Pattern] for more information on the way to use special characters in regex. + + If you are loading a file which contains Ctrl+A as separators, you can specify this to !PigStorage using the Unicode notation. {{{ - LOAD 'input.dat' USING PigStorage('\u0001'); + LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z); }}} - will use `^A` as a delimiter. + '''Q: How do I make my jobs run on multiple machines?''' - '''2. Can I do a numerical comparison while filtering?''' - - Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of [#CondS Conditions]. - - '''3. How do I make my jobs run on multiple machines?''' - - Use the `PARALLEL` clause: + Use the PARALLEL clause: {{{ C = JOIN A by url, B by url PARALLEL 50; }}} - '''4. I would like to use Pig to read a list of `.gz` files that use `'\u0001'` as a delimiter. How do I do that?''' + '''Q: How do I make my Pig jobs run on a specified number of reducers?''' + You can achieve this with the PARALLEL clause. For example: - You can use the following load command: - {{{ - LOAD 'input_file' USING PigStorage('\u0001'); + C = JOIN A by url, B by url PARALLEL 50. }}} - '''5. Does Pig support NULLs?''' + Even if you do not specify the parallel clause, the framework uses a default number of reducers, in the order of 0.9*(number of nodes allocated by user -1)*n where n is the number of maximum reduce slots, for running your M/R jobs which result from statements such as GROUP, COGROUP, JOIN, and ORDER BY. For example, when allocating 3 machines you get about 0.9*2*4 = 7 reducers for operating on your parallel jobs. - Pig currently has no support for NULL values but it is on the roadmap. + '''Q: Can I do a numerical comparison while filtering?''' + Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of [#CondS Conditions]. + + + + - '''6. Does Pig support regular expressions?''' + '''Q: Does Pig support regular expressions?''' Pig does support regular expression matching via the `matches` keyword. It uses [http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html java.util.regex] matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`). - '''7. How do I prevent failure if some records don't have the needed number of columns?''' + '''Q: How do I prevent failure if some records don't have the needed number of columns?''' You can filter away those records by including the following in your Pig program: @@ -50, +50 @@ This code would drop all records that have fewer than five (5) columns. - '''8. Is there any difference between `==` and `eq` for numeric comparisons?''' + '''Q: Is there any difference between `==` and `eq` for numeric comparisons?''' There is no difference when using integers. However, `11.0` and `11` will be equal with `==` but not with `eq`. - '''9. Is it possible to use PIG with a regular Hadoop cluster (not HOD) ?''' + '''Q: Is it possible to use PIG with a regular Hadoop cluster (not HOD)?''' You can set this property using the empty string. @@ -62, +62 @@ hod.server="" }}} - '''10. Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?''' + '''Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?''' - You can run the following set of commands: + You can run the following set of commands, which are equivalent to `SELECT COUNT(*)` in SQL: {{{ - a = LOAD 'bla' ... ; + a = LOAD 'mytestfile.txt'; b = GROUP a ALL; c = FOREACH b GENERATE COUNT(a.$0); }}} - This is equivalent to `SELECT COUNT(*)` in SQL. - '''11. Does Pig allow grouping on expressions?''' + '''Q: Does Pig allow grouping on expressions?''' - Currently, Pig only allows grouping on data fields rather than expressions. Allowing grouping on expressions is on our roadmap. Stay tuned! + Pig allows grouping of expressions. For example: - '''12. Is there a way to check if a map is empty?''' + {{{ + grunt> a = LOAD 'mytestfile.txt' AS (x,y,z); + grunt> DUMP a; + (1,2,3) + (4,2,1) + (4,3,4) + (4,3,4) + (7,2,5) + (8,4,3) - Currently, there is no way to do that. + b = GROUP a BY (x+y); + (3.0,{(1,2,3)}) + (6.0,{(4,2,1)}) + (7.0,{(4,3,4),(4,3,4)}) + (9.0,{(7,2,5)}) + (12.0,{(8,4,3)}) + }}} + If the grouping is based on constants, the result is the same as GROUP ALL except the group-id is replaced by the constant. + {{{ + grunt> b = GROUP a BY 4; + (4,{(1,2,3),(4,2,1),(4,3,4),(4,3,4),(7,2,5),(8,4,3)}) + }}} + '''Q: Is there a way to check if a map is empty?''' + + In Pig 2.0 you can test the existence of values in a map using the null construct: + m#'key' is not null + - '''13. How can I specify the number of nodes Pig allocates?''' + '''Q: How can I specify the number of nodes Pig allocates?''' {{{ > pig -Dhod.param='-m 3' my_script.pig @@ -90, +113 @@ Three (3) nodes is the minimum. - '''14. How can I load data using `PigStorage()` that requires Unicode specification for separators?''' + '''Q: How can I ask Pig to use an already allocated HOD cluster?''' - Old version of Pig using `'\t'`: + Suppose you allocated a cluster: + {{{ + $ mkdir -p ~/hod-clusters/test + $ hod allocate -d ~/hod-clusters/test -n 5 + $ setenv CLUSTERDIR ~/hod-clusters/test + }}} + + You can then use the following command, using either -Dhod.server=ââ or âDhod.server=ââ + {{{ + $ pig -cp $CLUSTERDIR -Dhod.server='' myscript.pig + }}} + - {{{ - a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\t'); - }}} - - New version of Pig using Unicode: - - {{{ - a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\u0000B'); - }}} -
