Re: Case for Optimization

Utkarsh Srivastava Wed, 12 Dec 2007 23:53:19 -0800

Can't you do

a = load ..
b = group a ..
c = foreach b generate sum, count, dozens of other aggregates;


?

Utkarsh


On Dec 12, 2007, at 11:43 PM, Ted Dunning wrote:

The biggest opportunity that Pig has to optimize performance (in myverylimited experience), is the ability to do more than one thing in asinglepass over the data. This seems to be the major advance representedin the

Pig work.

Particularly for aggregation jobs where the major effort consists of

grouping massive input for subsequent statistical abstracting, itwould befantastic to be able to be able to write many queries and then havethesystem reliably push all of the aggregations onto a single reducefunctionso that I only pass over the huge input file once and producedozens of

aggregate stats.

Even for my recommendation systems, the second largest cost is justpassingover the original data. Matrix reduction is a major cost, butpassing over

my logs once instead of three times would help enormously.



On 12/12/07 8:56 PM, "Chris Olston" <[EMAIL PROTECTED]> wrote:

Yup. It would be great to sprinkle a little relational query
optimization technology onto Pig.

Given that query optimization is a double-edged sword, we might want
to consider some guidelines of the form:

1. Optimizations should always be easy to override by the user.
(Sometimes the system is smarter than the user, but other times the
reverse is true, and that can be incredibly frustrating.)

2. Only "safe" optimizations should be performed, where a safe
optimization is one that with 95% probability doesn't make the
program slower. (An example is pushing filters before joins, given
that the filter is known to be cheap; if the filter has a user-
defined function it is not guaranteed to be cheap.) Or perhaps there
is a knob that controls worst-case versus expected-case minimization.

We're at a severe disadvantage relative to relational query engines,
because at the moment we have zero metadata. We don't even know the
schema of our data sets, much less the distributions of data values
(which in turn govern intermediate data sizes between operators). We
have to think about how to approach this that is compatible with the
Pig philosophy of having metadata always be optional. It could be as
simple as (fine, if the user doesn't want to "register" his data with
Pig, then Pig won't be able to optimize programs over that data very
well), or as sophisticated as on-line sampling and/or on-line
operator reordering.

-Chris


On Dec 12, 2007, at 7:10 PM, Amir Youssefi wrote:

Comparing two pig scripts of join+filter  and filter+join I see
that pig has
an optimization opportunity of first doing filter by constraints
then do the
actual join. Do we have a JIRA open for this (or other optimization
scenarios)?



In my case, the first one resulted in OutOfMemory exception but the
second
one runs just fine.



-Amir


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: Case for Optimization

Reply via email to