[ 
https://issues.apache.org/jira/browse/PIG-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-50.
---------------------------

       Resolution: Fixed
    Fix Version/s: 0.3.0

A rudimentary optimizer was added by 0.3, with ongoing work being done on it 
(see PIG-1178).

> query optimization for Pig
> --------------------------
>
>                 Key: PIG-50
>                 URL: https://issues.apache.org/jira/browse/PIG-50
>             Project: Pig
>          Issue Type: Wish
>          Components: impl
>            Reporter: Christopher Olston
>             Fix For: 0.3.0
>
>
> add relational query optimization techniques, or similar, to Pig
> discussion so far:
> ** Amir Youssefi:
> Comparing two pig scripts of join+filter  and filter+join I see that pig has
> an optimization opportunity of first doing filter by constraints then do the
> actual join. Do we have a JIRA open for this (or other optimization
> scenarios)? 
> In my case, the first one resulted in OutOfMemory exception but the second
> one runs just fine. 
> ** Chris Olston:
> Yup. It would be great to sprinkle a little relational query optimization 
> technology onto Pig.
> Given that query optimization is a double-edged sword, we might want to 
> consider some guidelines of the form:
> 1. Optimizations should always be easy to override by the user. (Sometimes 
> the system is smarter than the user, but other times the reverse is true, and 
> that can be incredibly frustrating.)
> 2. Only "safe" optimizations should be performed, where a safe optimization 
> is one that with 95% probability doesn't make the program slower. (An example 
> is pushing filters before joins, given that the filter is known to be cheap; 
> if the filter has a user-defined function it is not guaranteed to be cheap.) 
> Or perhaps there is a knob that controls worst-case versus expected-case 
> minimization.
> We're at a severe disadvantage relative to relational query engines, because 
> at the moment we have zero metadata. We don't even know the schema of our 
> data sets, much less the distributions of data values (which in turn govern 
> intermediate data sizes between operators). We have to think about how to 
> approach this that is compatible with the Pig philosophy of having metadata 
> always be optional. It could be as simple as (fine, if the user doesn't want 
> to "register" his data with Pig, then Pig won't be able to optimize programs 
> over that data very well), or as sophisticated as on-line sampling and/or 
> on-line operator reordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to