Optimization Idea: Dynamic histogram generation for join ordering?
------------------------------------------------------------------

                 Key: PIG-223
                 URL: https://issues.apache.org/jira/browse/PIG-223
             Project: Pig
          Issue Type: Improvement
            Reporter: Pi Song


This idea sprang into my mind when I was implementing explicit casting 
insertion for Type Checking.

Problem:

Given a query containing 3 or more joins, what is the most efficient join 
order? (Pig doesn't have indexing feature so statistics are not available)

Solution:

0. Start with a given plan 
1. Somehow select the first join (this is still an open question).
2. Insert histogram generator for columns used in remaining joins in the first 
MapReduce run.
3. Run MapReduce
4. Use histogram information generated from (2) to order joins for the rest of 
the plan
5. More MapReduce runs until finish.

There is another open question regarding histogram of joins based on calculated 
columns. In this case calculating histogram upfront might be conflicting with 
the conventional optimization technique "pulling filters up and pushing 
calculations down".

Not sure about usefulness because myself has never come across any 3-joins.

Any opinion?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to