Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/PigUserCookbook ------------------------------------------------------------------------------ This feature is new and experimental. It is experimental because we don't have a strong sense of how small the small table must be to fit into memory. In our tests with a simple query that involved just join a table of up to 100M can be used if the process overall gets 1 GB of memory. If the table does not fit into memory, the process would fail and generate an error. + '''Use PARALLEL Keyword''' + + PARALLEL controls the number of reducers invoked by Hadoop. The default out of the box is 1 which, in most cases, is not what you want. I reasonable heuristic to use is something like + {{{ + <num machines> * <num reduce slots per machine> * 0.9 + }}} + + The keyword makes sense on any operator that starts a reduce phase. This includes GROUP, COGROUP, JOIN, DISTINCT, LIMIT, ORDER BY. + + Example: + + {{{ + A = load 'myfile' as (t, u, v); + B = group A by t PARALLEL 18; + ..... + }}} + + '''Use LIMIT''' + + A lot of the times, you are not interested in the entire output but either a sample or top results. In those cases, using LIMIT can yeild a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. + + Sample: + {{{ + A = load 'myfile' as (t, u, v); + B = limit A 500; + }}} + + Top results: + {{{ + A = load 'myfile' as (t, u, v); + B = order A by t; + C = limit B 500; + }}} + '''Prefer DISTINCT over GROUP BY - GENERATE''' When it comes to extracting the unique values from a column in a relation, one of two approaches can be used:
