Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by AlanGates: http://wiki.apache.org/pig/PigDeveloperCookbook The comment on the change is: Moved performance for pig scripts into PigUserCookbook ------------------------------------------------------------------------------ To use Pig with the Eclipse IDE, see ["Eclipse Environment"]. - == Performance Enhancers == - - The following are a list of tips that people have discovered for making their pig queries run faster. Please feel free to add any tips you have. - - '''Project Early and Often''' - - Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: - - {{{ - A = load 'myfile' as (t, u, v); - B = load 'myotherfile' as (x, y, z); - C = join A by t, B by x; - D = group C by u; - E = foreach D generate group, COUNT($1); - }}} - - There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing - the above query to - - {{{ - A = load 'myfile' as (t, u, v); - A1 = foreach A generate t, u; - B = load 'myotherfile' as (x, y, z); - B1 = foreach B generate x; - C = join A1 by t, B1 by x; - C1 = foreach C generate t, u; - D = group C1 by u; - E = foreach D generate group, COUNT($1); - }}} - - will greatly reduce the amount of data being carried through the map and reduce phases by pig. Depending on your data, this can produce significant time savings. In - queries similar to the example given we have seen total time drop by 50%. - - '''Drop Nulls Before a Join''' - - This comment only applies to pig on the types branch, as pig 0.1.0 does not have nulls. - - With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are - grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be - passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row, in a - standard join the rows with a null key will always be dropped. The join: - - {{{ - A = load 'myfile' as (t, u, v); - B = load 'myotherfile' as (x, y, z); - C = join A by t, B by x; - }}} - - is rewritten by pig to - - {{{ - A = load 'myfile' as (t, u, v); - B = load 'myotherfile' as (x, y, z); - C1 = cogroup A by t, B by x; - C = foreach C1 generate flatten(A), flatten(B); - }}} - - Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null - keys will be dropped. But they will not be dropped until the last possible moment. If the query is rewritten to - - {{{ - A = load 'myfile' as (t, u, v); - B = load 'myotherfile' as (x, y, z); - A1 = filter A by t is not null; - B1 = filter B by x is not null; - C = join A1 by t, B1 by x; - }}} - - then the nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be - significant. In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a 6x speed up in the query by adding the early - filters. -