[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054816#comment-13054816
 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

The current optimizer can identify correlations with query plan tree structures 
like TPC-H Q17 (in attached file Queries). Using Q17 as an example, sub-query 
(denoted as sub-Q1 and originally executed by MapReduce job J1) "SELECT 
l_partkey as t_partkey, 0.2 * avg(l_quantity) AS t_avg_quantity FROM lineitem 
GROUP BY l_partkey" has correlation with sub-query (denoted as sub-Q2 and 
originally executed by MapReduce job J2) "SELECT l_quantity, l_partkey, 
l_extendedprice FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey AND 
p.p_brand = 'Brand#52' AND p.p_container = 'JUMBO CAN'", because (1)sub-Q1 and 
sub-Q2 share the same input 'lineitem'; (2) ReduceSinkOperators in J1 and J2 
share the same 'key', which is l_partkey (p_partkey). Also, because 
intermediate tables generated by sub-Q1 and sub-Q2 will be joined by a 
MapReduce job J3, of which the 'key' of ReduceSinkOperator is 'l_partkey', J3 
has correlation with J1 and J2. Thus, J1, J2 and J3 can be merged into one 
MapReduce job J'. In the map function of J', a composite operator will be used 
to execute FilterOperators (if any) for sub-Q1 and sub-Q2. Then, in the reduce 
function of J', a dispatch operator is used to dispatch reduce-input records to 
JoinOperator in J1 and GroupByOperator in J2. Then, the results of JoinOperator 
and GroupByOperator will be fed to the JoinOperator in J3. 

For this optimizer, there are several issues. 

1: Because for the MapReduce job executing correlated MapReduce jobs, 
intermediate key/value pairs will be consumed by multiple operators, Map-side 
Aggregation is disabled. 
2: For the MapReduce job executing correlated MapReduce jobs, if the depth of 
execution path in the reduce function is not the same (for example "SELECT * 
FROM lineitem l1 JOIN (SELECT l_partkey FROM part p JOIN lineitem l ON 
p.p_partkey = l.l_partkey) tmp ON l1.partkey = tmp.partkey"), one or multiple 
YSmartForwardOperator should be used. I have not completely solved this issue.
3: For two independent MapReduce jobs J1 and J2, the current correlation 
identifier only searches ReduceSinkOperators with the same 'key(s)' for 
correlation, actually the set of 'key(s)' of the ReduceSinkOperator in J1 is a 
subset of that in J2, these two MapReduce jobs are correlated. (Also, 
sub-queries with distinct keyword associated with Group By clause is under this 
issue, since distinct keyword is handled by using all columns as 'keys' in its 
corresponding ReduceSinkOperator)
4: The current correlation identifier can not identify correlations represented 
by columns involving "max(<column name>)" or "min(<column name>)".

I will start working on this optimizer in August and will firstly solve issues 
2-4 mentioned above.  

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to