Re: Case for Optimization

Chris Olston Wed, 12 Dec 2007 20:56:47 -0800

Yup. It would be great to sprinkle a little relational queryoptimization technology onto Pig.

Given that query optimization is a double-edged sword, we might wantto consider some guidelines of the form:

1. Optimizations should always be easy to override by the user.(Sometimes the system is smarter than the user, but other times thereverse is true, and that can be incredibly frustrating.)

2. Only "safe" optimizations should be performed, where a safeoptimization is one that with 95% probability doesn't make theprogram slower. (An example is pushing filters before joins, giventhat the filter is known to be cheap; if the filter has a user-defined function it is not guaranteed to be cheap.) Or perhaps thereis a knob that controls worst-case versus expected-case minimization.

We're at a severe disadvantage relative to relational query engines,because at the moment we have zero metadata. We don't even know theschema of our data sets, much less the distributions of data values(which in turn govern intermediate data sizes between operators). Wehave to think about how to approach this that is compatible with thePig philosophy of having metadata always be optional. It could be assimple as (fine, if the user doesn't want to "register" his data withPig, then Pig won't be able to optimize programs over that data verywell), or as sophisticated as on-line sampling and/or on-lineoperator reordering.


-Chris


On Dec 12, 2007, at 7:10 PM, Amir Youssefi wrote:

Comparing two pig scripts of join+filter and filter+join I seethat pig hasan optimization opportunity of first doing filter by constraintsthen do the
actual join. Do we have a JIRA open for this (or other optimization
scenarios)?
In my case, the first one resulted in OutOfMemory exception but thesecond
one runs just fine.



-Amir


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: Case for Optimization

Reply via email to