Apache Wiki
Wed, 02 Apr 2008 13:59:48 -0700
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by ChrisOlston: http://wiki.apache.org/pig/PigOptimizationWishList New page: == Query Optimization Ideas for Pig == === Already Implemented === * pipeline a sequence of stateless operators into a single Map or single Reduce === Implemented, but room for improvement === * push algebraic functions into combiner, including algebraic UDFs, DISTINCT, and other items === Low hanging fruit === * System-R optimizer heuristics: * push projections (move them earlier in the plan) * push cheap filters (move filters known to be cheap, e.g. ones with simple logic predicates, earlier in the plan) * eliminate cartesian products when possible, e.g. convert CROSS followed by FILTER into JOIN === Medium hanging fruit === * look for ways to do multiple group/cogroup/join operations in a single map-reduce job --- this would occur if the keys share a common prefix. Example: group by userid+hour, then count, then group by userid, then take max --- can be done in one map-reduce job with userid as the reduce key. * choose a join strategy (symmetric hashing, fragment-and-replicate, ...); can probably make a reasonable choice based on file sizes [but first, we have to implement various join strategies in the execution layer -- currently pig only supports symmetric hashing] === Probably won't get there, and may not even want to go there === * query optimization techniques found in any database textbook * reordering filters (need to estimate selectivity based on histograms, or maybe adaptively reorder) * ...