[Pig Wiki] Update of "PigUserCookbook" by FlipKromer

Apache Wiki Fri, 16 Jan 2009 20:57:58 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by FlipKromer:
http://wiki.apache.org/pig/PigUserCookbook

The comment on the change is:
The COGROUP in a JOIN is implicitly INNER 

------------------------------------------------------------------------------
  
  This comment only applies to pig on the types branch, as pig 0.1.0 does not 
have nulls.
  
+ With the introduction of nulls, join and cogroup semantics were altered to 
work with nulls.  The semantic for cogrouping with nulls is that nulls from a 
given input are grouped together, but nulls across inputs are not grouped 
together.  This preserves the semantics of grouping (nulls are collected 
together from a single input to be passed to aggregate functions like COUNT) 
and the semantics of join (nulls are not joined across inputs).  Since 
flattening an empty bag results in an empty row, in a standard join the rows 
with a null key will always be dropped.  The join: 
- With the introduction of nulls, join and cogroup semantics were altered to 
work with nulls.  The semantic for cogrouping with nulls is that nulls from a 
given input are
- grouped together, but nulls across inputs are not grouped together.  This 
preserves the semantics of grouping (nulls are collected together from a single 
input to be
- passed to aggregate functions like COUNT) and the semantics of join (nulls 
are not joined across inputs).  Since flattening an empty bag results in an 
empty row, in a
- standard join the rows with a null key will always be dropped.  The join: 
  
  {{{
  A = load 'myfile' as (t, u, v);
@@ -112, +109 @@

  {{{
  A = load 'myfile' as (t, u, v);
  B = load 'myotherfile' as (x, y, z);
- C1 = cogroup A by t, B by x;
+ C1 = cogroup A by t INNER, B by x INNER;
  C = foreach C1 generate flatten(A), flatten(B);
  }}}
  
+ Since the nulls from A and B won't be collected together, when the nulls are 
flattened we're guaranteed to have an empty bag, which will result in no 
output.  So the null keys will be dropped.  But they will not be dropped until 
the last possible moment.  If the query is rewritten to
- Since the nulls from A and B won't be collected together, when the nulls are 
flattened we're guaranteed to have an empty bag, which will result in no 
output.  So the null
- keys will be dropped.  But they will not be dropped until the last possible 
moment.  If the query is rewritten to
  
  {{{
  A = load 'myfile' as (t, u, v);
@@ -127, +123 @@

  C = join A1 by t, B1 by x;
  }}}
  
+ then the nulls will be dropped before the join.  Since all null keys go to a 
single reducer, if your key is null even a small percentage of the time the 
gain can be significant.  In one test where the key was null 7% of the time and 
the data was spread across 200 reducers, we saw a about a 10x speed up in the 
query by adding the early filters.
- then the nulls will be dropped before the join.  Since all null keys go to a 
single reducer, if your key is null even a small percentage of the time the 
gain can be
- significant.  In one test where the key was null 7% of the time and the data 
was spread across 200 reducers, we saw a about a 10x speed up in the query by 
adding the early
- filters.
  
  '''Take Advantage of Join Optimization'''

[Pig Wiki] Update of "PigUserCookbook" by FlipKromer

Reply via email to