Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by CorinneC: http://wiki.apache.org/pig/FAQ ------------------------------------------------------------------------------ - Pig FAQ + '''1. I'm using `PigStorage` to parse my input files. Can I make it use control characters as delimiters?''' - 1. I'm using PigStorage to parse my input files. Can I make it use control characters as delimiters? + Yes. The first parameter to `PigStorage` is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html java.util.regex.Pattern] for more information on the way to use special characters in regex. For example, - A. Yes. Examples: PigStorage('\u0001') for Ctrl+A or '\u007C' for this character: | + {{{ + LOAD 'input.dat' USING PigStorage('\u0001'); + }}} - 2. Can I do a numerical comparison while filtering? + will use `^A` as a delimiter. - A. Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. + '''2. Can I do a numerical comparison while filtering?''' - 3. How do I make my jobs run on multiple machines? + Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of [#CondS Conditions]. - A. Use the PARALLEL clause. For example =C = JOIN A by url, B by url PARALLEL 50= + '''3. How do I make my jobs run on multiple machines?''' - 4. Does Pig support NULLs? + Use the `PARALLEL` clause: - A. Pig currently has no support for NULL values but it is on the roadmap. + {{{ + C = JOIN A by url, B by url PARALLEL 50; + }}} - 5. Does pig support regular expressions? + '''4. I would like to use Pig to read a list of `.gz` files that use `'\u0001'` as a delimiter. How do I do that?''' - A. Pig does support regular expression matching via =matches= keyward. Tt uses java.util.regexp matches which means your pattern has to match the entire string (ie if your string is "hi fred" and you want to find "fred" you have to give a pattern of ".*fred" not "fred"). + You can use the following load command: + {{{ + LOAD 'input_file' USING PigStorage('\u0001'); + }}} + + '''5. Does Pig support NULLs?''' + + Pig currently has no support for NULL values but it is on the roadmap. + + '''6. Does Pig support regular expressions?''' + + Pig does support regular expression matching via the `matches` keyword. It uses [http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html java.util.regex] matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`). + - 6. How to prevent failure if some records don't have the needed number of columns. + '''7. How do I prevent failure if some records don't have the needed number of columns?''' You can filter away those records by including the following in your Pig program: - + {{{ - A = load 'foo' using PigStorage('\t'); + A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) < 5; ..... + }}} + This code would drop all records that have fewer than five (5) columns. - This code would drop all the records that has less than 5 columns. + '''8. Is there any difference between `==` and `eq` for numeric comparisons?''' - 7. Is there any difference between == and eq for numeric comparisons? + There is no difference when using integers. However, `11.0` and `11` will be equal with `==` but not with `eq`. - For equality, there is no difference while you stay in integers. However 11.0 and 11 will be equal with == but not with eq. + '''9. Is it possible to use PIG with a regular Hadoop cluster (not HOD) ?''' + You can set this property using the empty string. + + {{{ + hod.server="" + }}} + - 8. Is there an easy way for me to figure out how many rows exists in a dataset from its alias? + '''10. Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?''' You can run the following set of commands: + {{{ + a = LOAD 'bla' ... ; + b = GROUP a ALL; + c = FOREACH b GENERATE COUNT(a.$0); + }}} - a = load 'bla' ... ; + This is equivalent to `SELECT COUNT(*)` in SQL. - b = group a all; + '''11. Does Pig allow grouping on expressions?''' - c = foreach b generate COUNT(a.$0); + Currently, Pig only allows grouping on data fields rather than expressions. Allowing grouping on expressions is on our roadmap. Stay tuned! + '''12. Is there a way to check if a map is empty?''' - This is equivalent to select count(*) in SQL. + Currently, there is no way to do that. - 9. Does Pig allow grouping on expressions + '''13. How can I specify the number of nodes Pig allocates?''' - Currently, Pig only allows to group on data fields rather than expressions. Allowing grouping on expressions is on our road map. Stay tuned! + {{{ + > pig -Dhod.param='-m 3' my_script.pig + }}} + Three (3) nodes is the minimum. + + '''14. How can I load data using `PigStorage()` that requires Unicode specification for separators?''' + + Old version of Pig using `'\t'`: + + {{{ + a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\t'); + }}} + + New version of Pig using Unicode: + + {{{ + a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\u0000B'); + }}} +
