Optimization Idea: Dynamic histogram generation for join ordering?
------------------------------------------------------------------
Key: PIG-223
URL: https://issues.apache.org/jira/browse/PIG-223
Project: Pig
Issue Type: Improvement
Reporter: Pi Song
This idea sprang into my mind when I was implementing explicit casting
insertion for Type Checking.
Problem:
Given a query containing 3 or more joins, what is the most efficient join
order? (Pig doesn't have indexing feature so statistics are not available)
Solution:
0. Start with a given plan
1. Somehow select the first join (this is still an open question).
2. Insert histogram generator for columns used in remaining joins in the first
MapReduce run.
3. Run MapReduce
4. Use histogram information generated from (2) to order joins for the rest of
the plan
5. More MapReduce runs until finish.
There is another open question regarding histogram of joins based on calculated
columns. In this case calculating histogram upfront might be conflicting with
the conventional optimization technique "pulling filters up and pushing
calculations down".
Not sure about usefulness because myself has never come across any 3-joins.
Any opinion?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.