[Pig Wiki] Update of "PigDeveloperCookbook" by AlanGates

Apache Wiki Tue, 21 Oct 2008 15:57:00 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigDeveloperCookbook

The comment on the change is:
Moved performance for pig scripts into PigUserCookbook

------------------------------------------------------------------------------
  
  To use Pig with the Eclipse IDE, see ["Eclipse Environment"].
  
- == Performance Enhancers ==
- 
- The following are a list of tips that people have discovered for making their 
pig queries run faster.  Please feel free to add any tips you have.
- 
- '''Project Early and Often'''
- 
- Pig does not (yet) determine when a field is no longer needed and drop the 
field from the row.  For example, say you have a query like:
- 
- {{{
- A = load 'myfile' as (t, u, v);
- B = load 'myotherfile' as (x, y, z);
- C = join A by t, B by x;
- D = group C by u;
- E = foreach D generate group, COUNT($1);
- }}}
- 
- There is no need for v, y, or z to participate in this query.  And there is 
no need to carry both t and x past the join, just one will suffice.  Changing
- the above query to 
- 
- {{{
- A = load 'myfile' as (t, u, v);
- A1 = foreach A generate t, u;
- B = load 'myotherfile' as (x, y, z);
- B1 = foreach B generate x;
- C = join A1 by t, B1 by x;
- C1 = foreach C generate t, u;
- D = group C1 by u;
- E = foreach D generate group, COUNT($1);
- }}} 
- 
- will greatly reduce the amount of data being carried through the map and 
reduce phases by pig.  Depending on your data, this can produce significant 
time savings.  In
- queries similar to the example given we have seen total time drop by 50%.
- 
- '''Drop Nulls Before a Join'''
- 
- This comment only applies to pig on the types branch, as pig 0.1.0 does not 
have nulls.
- 
- With the introduction of nulls, join and cogroup semantics were altered to 
work with nulls.  The semantic for cogrouping with nulls is that nulls from a 
given input are
- grouped together, but nulls across inputs are not grouped together.  This 
preserves the semantics of grouping (nulls are collected together from a single 
input to be
- passed to aggregate functions like COUNT) and the semantics of join (nulls 
are not joined across inputs).  Since flattening an empty bag results in an 
empty row, in a
- standard join the rows with a null key will always be dropped.  The join: 
- 
- {{{
- A = load 'myfile' as (t, u, v);
- B = load 'myotherfile' as (x, y, z);
- C = join A by t, B by x;
- }}}
- 
- is rewritten by pig to
- 
- {{{
- A = load 'myfile' as (t, u, v);
- B = load 'myotherfile' as (x, y, z);
- C1 = cogroup A by t, B by x;
- C = foreach C1 generate flatten(A), flatten(B);
- }}}
- 
- Since the nulls from A and B won't be collected together, when the nulls are 
flattened we're guaranteed to have an empty bag, which will result in no 
output.  So the null
- keys will be dropped.  But they will not be dropped until the last possible 
moment.  If the query is rewritten to
- 
- {{{
- A = load 'myfile' as (t, u, v);
- B = load 'myotherfile' as (x, y, z);
- A1 = filter A by t is not null;
- B1 = filter B by x is not null;
- C = join A1 by t, B1 by x;
- }}}
- 
- then the nulls will be dropped before the join.  Since all null keys go to a 
single reducer, if your key is null even a small percentage of the time the 
gain can be
- significant.  In one test where the key was null 7% of the time and the data 
was spread across 200 reducers, we saw a 6x speed up in the query by adding the 
early
- filters.
-

[Pig Wiki] Update of "PigDeveloperCookbook" by AlanGates

Reply via email to