[Pig Wiki] Trivial Update of "FAQ" by CorinneC

Apache Wiki Mon, 09 Mar 2009 10:58:00 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by CorinneC:
http://wiki.apache.org/pig/FAQ

------------------------------------------------------------------------------
- '''1. I'm using `PigStorage` to parse my input files. Can I make it use 
control characters as delimiters?''' 
+ '''Q: How can I load data using Unicode control characters as delimiters?''' 
  
- Yes. The first parameter to `PigStorage` is the dataset name, the second is a 
regular expression to describe the delimiter. We used `String.split(regex, -1)` 
to extract fields from lines. See 
[http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html 
java.util.regex.Pattern] for more information on the way to use special 
characters in regex. For example,
+ The first parameter to !PigStorage is the dataset name, the second is a 
regular expression to describe the delimiter. We used `String.split(regex, -1)` 
to extract fields from lines. See 
[http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html 
java.util.regex.Pattern] for more information on the way to use special 
characters in regex. 
+ 
+ If you are loading a file which contains Ctrl+A as separators, you can 
specify this to !PigStorage using the Unicode notation.
  
  {{{
- LOAD 'input.dat' USING PigStorage('\u0001');
+ LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z);
  }}}
  
- will use `^A` as a delimiter.
+ '''Q: How do I make my jobs run on multiple machines?'''
  
- '''2. Can I do a numerical comparison while filtering?'''
- 
- Yes, you can choose between numerical and string comparison. For numerical 
comparison use the operators =, <>, <  etc. and for string comparisons use eq, 
neq etc. See the format of [#CondS Conditions].
- 
- '''3. How do I make my jobs run on multiple machines?'''
- 
- Use the `PARALLEL` clause:
+ Use the PARALLEL clause:
  
  {{{
  C = JOIN A by url, B by url PARALLEL 50;
  }}}
  
- '''4. I would like to use Pig to read a list of `.gz` files that use 
`'\u0001'` as a delimiter. How do I do that?'''
+ '''Q: How do I make my Pig jobs run on a specified number of reducers?'''
  
+ You can achieve this with the PARALLEL clause. For example: 
- You can use the following load command:
- 
  {{{
- LOAD 'input_file' USING PigStorage('\u0001');
+ C = JOIN A by url, B by url PARALLEL 50. 
  }}}
  
- '''5. Does Pig support NULLs?'''
+ Even if you do not specify the parallel clause, the framework uses a default 
number of reducers, in the order of 0.9*(number of nodes allocated by user 
-1)*n where n is the number of maximum reduce slots, for running your M/R jobs 
which result from statements such as GROUP, COGROUP, JOIN, and ORDER BY. For 
example, when allocating 3 machines you get about 0.9*2*4 = 7 reducers for 
operating on your parallel jobs. 
  
- Pig currently has no support for NULL values but it is on the roadmap.
+ '''Q: Can I do a numerical comparison while filtering?'''
  
+ Yes, you can choose between numerical and string comparison. For numerical 
comparison use the operators =, <>, <  etc. and for string comparisons use eq, 
neq etc. See the format of [#CondS Conditions].
+ 
+ 
+ 
+ 
- '''6. Does Pig support regular expressions?'''
+ '''Q: Does Pig support regular expressions?'''
  
  Pig does support regular expression matching via the `matches` keyword. It 
uses 
[http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html 
java.util.regex] matches which means your pattern has to match the entire 
string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you 
have to give a pattern of `".*fred"` not `"fred"`).
  
- '''7. How do I prevent failure if some records don't have the needed number 
of columns?'''
+ '''Q: How do I prevent failure if some records don't have the needed number 
of columns?'''
  
  You can filter away those records by including the following in your Pig 
program:
  
@@ -50, +50 @@

  
  This code would drop all records that have fewer than five (5) columns.
  
- '''8. Is there any difference between `==` and `eq` for numeric 
comparisons?'''
+ '''Q: Is there any difference between `==` and `eq` for numeric 
comparisons?'''
  
  There is no difference when using integers. However, `11.0` and `11` will be 
equal with `==` but not with `eq`. 
  
- '''9. Is it possible to use PIG with a regular Hadoop cluster (not HOD) ?'''
+ '''Q: Is it possible to use PIG with a regular Hadoop cluster (not HOD)?'''
  
  You can set this property using the empty string.
  
@@ -62, +62 @@

  hod.server=""
  }}}
  
- '''10. Is there an easy way for me to figure out how many rows exist in a 
dataset from it's alias?'''
+ '''Q: Is there an easy way for me to figure out how many rows exist in a 
dataset from it's alias?'''
  
- You can run the following set of commands:
+ You can run the following set of commands, which are equivalent to `SELECT 
COUNT(*)` in SQL:
  
  {{{
- a = LOAD 'bla' ... ;
+ a = LOAD 'mytestfile.txt';
  b = GROUP a ALL;
  c = FOREACH b GENERATE COUNT(a.$0);
  }}}
  
- This is equivalent to `SELECT COUNT(*)` in SQL.
  
- '''11. Does Pig allow grouping on expressions?'''
+ '''Q: Does Pig allow grouping on expressions?'''
  
- Currently, Pig only allows grouping on data fields rather than expressions. 
Allowing grouping on expressions is on our roadmap. Stay tuned!
+ Pig allows grouping of expressions. For example:
  
- '''12. Is there a way to check if a map is empty?'''
+ {{{
+ grunt> a = LOAD 'mytestfile.txt' AS (x,y,z);
+ grunt> DUMP a;
+ (1,2,3)
+ (4,2,1)
+ (4,3,4)
+ (4,3,4)
+ (7,2,5)
+ (8,4,3)
  
- Currently, there is no way to do that.
+ b = GROUP a BY (x+y);
+ (3.0,{(1,2,3)})
+ (6.0,{(4,2,1)})
+ (7.0,{(4,3,4),(4,3,4)})
+ (9.0,{(7,2,5)})
+ (12.0,{(8,4,3)})
+ }}}
  
+ If the grouping is based on constants, the result is the same as GROUP ALL 
except the group-id is replaced by the constant.
+ {{{
+ grunt> b = GROUP a BY 4;
+ (4,{(1,2,3),(4,2,1),(4,3,4),(4,3,4),(7,2,5),(8,4,3)})
+ }}}
+ '''Q: Is there a way to check if a map is empty?'''
+ 
+ In Pig 2.0 you can test the existence of values in a map using the null 
construct: 
+ m#'key' is not null
+ 
- '''13. How can I specify the number of nodes Pig allocates?'''
+ '''Q: How can I specify the number of nodes Pig allocates?'''
  
  {{{
  > pig -Dhod.param='-m 3' my_script.pig
@@ -90, +113 @@

  
  Three (3) nodes is the minimum.
  
- '''14. How can I load data using `PigStorage()` that requires Unicode 
specification for separators?'''
+ '''Q: How can I ask Pig to use an already allocated HOD cluster?''' 
  
- Old version of Pig using `'\t'`:
+ Suppose you allocated a cluster:
+ {{{
+ $ mkdir -p ~/hod-clusters/test
+ $ hod allocate -d ~/hod-clusters/test -n 5
+ $ setenv CLUSTERDIR ~/hod-clusters/test
+ }}}
+  
+ You can then use the following command, using either -Dhod.server=ââ or 
âDhod.server=ââ
+ {{{
+ $ pig -cp $CLUSTERDIR -Dhod.server='' myscript.pig 
+ }}}
+  
  
- {{{
- a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\t');
- }}}
- 
- New version of Pig using Unicode:
- 
- {{{
- a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\u0000B');
- }}}
-

[Pig Wiki] Trivial Update of "FAQ" by CorinneC

Reply via email to