[
https://issues.apache.org/jira/browse/PIG-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594044#action_12594044
]
Pi Song commented on PIG-223:
-----------------------------
That's right. Collecting meta data will help a lot. There are 2 cases:-
1. User directs the meta data creation. This is like creating indexes in RDBMS
2. Dynamic meta data creation. This may happen as a part of optimization when
user runs an adhoc query.
> Optimization Idea: Dynamic histogram generation for join ordering?
> ------------------------------------------------------------------
>
> Key: PIG-223
> URL: https://issues.apache.org/jira/browse/PIG-223
> Project: Pig
> Issue Type: Improvement
> Reporter: Pi Song
>
> This idea sprang into my mind when I was implementing explicit casting
> insertion for Type Checking.
> Problem:
> Given a query containing 3 or more joins, what is the most efficient join
> order? (Pig doesn't have indexing feature so statistics are not available)
> Solution:
> 0. Start with a given plan
> 1. Somehow select the first join (this is still an open question).
> 2. Insert histogram generator for columns used in remaining joins in the
> first MapReduce run.
> 3. Run MapReduce
> 4. Use histogram information generated from (2) to order joins for the rest
> of the plan
> 5. More MapReduce runs until finish.
> There is another open question regarding histogram of joins based on
> calculated columns. In this case calculating histogram upfront might be
> conflicting with the conventional optimization technique "pulling filters up
> and pushing calculations down".
> Not sure about usefulness because myself has never come across any 3-joins.
> Any opinion?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.