[Pig Wiki] Update of "PigUserCookbook" by OlgaN

Apache Wiki Tue, 27 Jan 2009 17:14:56 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigUserCookbook

------------------------------------------------------------------------------
  
  This feature is new and experimental. It is experimental because we don't 
have a strong sense of how small the small table must be to fit into memory. In 
our tests with a simple query that involved just join a table of up to 100M can 
be used if the process overall gets 1 GB of memory. If the table does not fit 
into memory, the process would fail and generate an error.
  
+ '''Use PARALLEL Keyword'''
+ 
+ PARALLEL controls the number of reducers invoked by Hadoop. The default out 
of the box is 1 which, in most cases, is not what you want. I reasonable 
heuristic to use is something like 
+ {{{
+ <num machines> * <num reduce slots per machine> * 0.9
+ }}}
+ 
+ The keyword makes sense on any operator that starts a reduce phase. This 
includes GROUP, COGROUP, JOIN, DISTINCT, LIMIT, ORDER BY.
+ 
+ Example:
+ 
+ {{{
+ A = load 'myfile' as (t, u, v);
+ B = group A by t PARALLEL 18;
+ .....
+ }}}
+ 
+ '''Use LIMIT'''
+ 
+ A lot of the times, you are not interested in the entire output but either a 
sample or top results. In those cases, using LIMIT can yeild a much better 
performance as we push the limit as high as possible to minimize the amount of 
data travelling through the pipeline.
+ 
+ Sample:
+ {{{
+ A = load 'myfile' as (t, u, v);
+ B = limit A 500;
+ }}}
+ 
+ Top results:
+ {{{
+ A = load 'myfile' as (t, u, v);
+ B = order A by t;
+ C = limit B 500;
+ }}}
+ 
  '''Prefer DISTINCT over GROUP BY - GENERATE'''
  
  When it comes to extracting the unique values from a column in a relation, 
one of two approaches can be used:

[Pig Wiki] Update of "PigUserCookbook" by OlgaN

Reply via email to