[
https://issues.apache.org/jira/browse/PIG-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593104#action_12593104
]
Olga Natkovich commented on PIG-223:
------------------------------------
I think as Pig becomes more mature, we will start collecting and storing needed
metadata such as data sizes, column cordinality, sort/partition order of the
data, etc.
Trying to dynamically compute the information if it is not available sounds
like a good idea.
> Optimization Idea: Dynamic histogram generation for join ordering?
> ------------------------------------------------------------------
>
> Key: PIG-223
> URL: https://issues.apache.org/jira/browse/PIG-223
> Project: Pig
> Issue Type: Improvement
> Reporter: Pi Song
>
> This idea sprang into my mind when I was implementing explicit casting
> insertion for Type Checking.
> Problem:
> Given a query containing 3 or more joins, what is the most efficient join
> order? (Pig doesn't have indexing feature so statistics are not available)
> Solution:
> 0. Start with a given plan
> 1. Somehow select the first join (this is still an open question).
> 2. Insert histogram generator for columns used in remaining joins in the
> first MapReduce run.
> 3. Run MapReduce
> 4. Use histogram information generated from (2) to order joins for the rest
> of the plan
> 5. More MapReduce runs until finish.
> There is another open question regarding histogram of joins based on
> calculated columns. In this case calculating histogram upfront might be
> conflicting with the conventional optimization technique "pulling filters up
> and pushing calculations down".
> Not sure about usefulness because myself has never come across any 3-joins.
> Any opinion?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.