Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by FlipKromer: http://wiki.apache.org/pig/PigUserCookbook The comment on the change is: The COGROUP in a JOIN is implicitly INNER ------------------------------------------------------------------------------ This comment only applies to pig on the types branch, as pig 0.1.0 does not have nulls. + With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row, in a standard join the rows with a null key will always be dropped. The join: - With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are - grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be - passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row, in a - standard join the rows with a null key will always be dropped. The join: {{{ A = load 'myfile' as (t, u, v); @@ -112, +109 @@ {{{ A = load 'myfile' as (t, u, v); B = load 'myotherfile' as (x, y, z); - C1 = cogroup A by t, B by x; + C1 = cogroup A by t INNER, B by x INNER; C = foreach C1 generate flatten(A), flatten(B); }}} + Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null keys will be dropped. But they will not be dropped until the last possible moment. If the query is rewritten to - Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null - keys will be dropped. But they will not be dropped until the last possible moment. If the query is rewritten to {{{ A = load 'myfile' as (t, u, v); @@ -127, +123 @@ C = join A1 by t, B1 by x; }}} + then the nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant. In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early filters. - then the nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be - significant. In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early - filters. '''Take Advantage of Join Optimization'''
