Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by OlgaN:

  In some of our tests we saw 10x performance improvement as the result of this 
+ '''Use Fragment Replicate Join'''
+ This type of join works well if one of tables is small enough to fit into 
main memory. In this case, Pig can perform a very efficient join since, in 
hadoop world, it can be done completely on the map side.
+ {{{
+ tiny = load 'small_file' as (t, u, v);
+ large = load 'large_file' as (x, y, z);
+ C = join big by t, tiny by x using "replicated";
+ }}}
+ Note that the large table must come first followed by one or more small 
tables. All small tables together must fit into main memory.
+ This feature is new and experimental. It is experimental because we don't 
have a strong sense of how small the small table must be to fit into memory. In 
our tests with a simple query that involved just join a table of up to 100M can 
be used if the process overall gets 1 GB of memory. If the table does not fit 
into memory, the process would fail and generate an error.
  '''Prefer DISTINCT over GROUP BY - GENERATE'''
  When it comes to extracting the unique values from a column in a relation, 
one of two approaches can be used:

Reply via email to