[Pig Wiki] Update of "PigUserCookbook" by OlgaN

Apache Wiki Thu, 15 Jan 2009 11:33:21 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigUserCookbook

------------------------------------------------------------------------------
  
  In some of our tests we saw 10x performance improvement as the result of this 
optimization.
  
+ '''Use Fragment Replicate Join'''
+ 
+ This type of join works well if one of tables is small enough to fit into 
main memory. In this case, Pig can perform a very efficient join since, in 
hadoop world, it can be done completely on the map side.
+ 
+ {{{
+ tiny = load 'small_file' as (t, u, v);
+ large = load 'large_file' as (x, y, z);
+ C = join big by t, tiny by x using "replicated";
+ }}}
+ 
+ Note that the large table must come first followed by one or more small 
tables. All small tables together must fit into main memory.
+ 
+ This feature is new and experimental. It is experimental because we don't 
have a strong sense of how small the small table must be to fit into memory. In 
our tests with a simple query that involved just join a table of up to 100M can 
be used if the process overall gets 1 GB of memory. If the table does not fit 
into memory, the process would fail and generate an error.
+ 
  '''Prefer DISTINCT over GROUP BY - GENERATE'''
  
  When it comes to extracting the unique values from a column in a relation, 
one of two approaches can be used:

[Pig Wiki] Update of "PigUserCookbook" by OlgaN

Reply via email to