Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by ChrisOlston:

New page:
== Query Optimization Ideas for Pig ==

=== Already Implemented ===

* pipeline a sequence of stateless operators into a single Map or single Reduce

=== Implemented, but room for improvement ===

* push algebraic functions into combiner, including algebraic UDFs, DISTINCT, 
and other items

=== Low hanging fruit ===

* System-R optimizer heuristics:
   * push projections (move them earlier in the plan)
   * push cheap filters (move filters known to be cheap, e.g. ones with simple 
logic predicates, earlier in the plan)
   * eliminate cartesian products when possible, e.g. convert CROSS followed by 

=== Medium hanging fruit ===

* look for ways to do multiple group/cogroup/join operations in a single 
map-reduce job --- this would occur if the keys share a common prefix. Example: 
group by userid+hour, then count, then group by userid, then take max --- can 
be done in one map-reduce job with userid as the reduce key.
* choose a join strategy (symmetric hashing, fragment-and-replicate, ...); can 
probably make a reasonable choice based on file sizes [but first, we have to 
implement various join strategies in the execution layer -- currently pig only 
supports symmetric hashing]

=== Probably won't get there, and may not even want to go there ===

* query optimization techniques found in any database textbook
   * reordering filters (need to estimate selectivity based on histograms, or 
maybe adaptively reorder)
   * ...

Reply via email to