Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/PigUserCookbook ------------------------------------------------------------------------------ In some of our tests we saw 10x performance improvement as the result of this optimization. + '''Use Fragment Replicate Join''' + + This type of join works well if one of tables is small enough to fit into main memory. In this case, Pig can perform a very efficient join since, in hadoop world, it can be done completely on the map side. + + {{{ + tiny = load 'small_file' as (t, u, v); + large = load 'large_file' as (x, y, z); + C = join big by t, tiny by x using "replicated"; + }}} + + Note that the large table must come first followed by one or more small tables. All small tables together must fit into main memory. + + This feature is new and experimental. It is experimental because we don't have a strong sense of how small the small table must be to fit into memory. In our tests with a simple query that involved just join a table of up to 100M can be used if the process overall gets 1 GB of memory. If the table does not fit into memory, the process would fail and generate an error. + '''Prefer DISTINCT over GROUP BY - GENERATE''' When it comes to extracting the unique values from a column in a relation, one of two approaches can be used:
