[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101952#comment-13101952
 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

Almost finish the patch. I did a preliminary test based on TPC-H Q17 and Q18. 
My machine has a quad-core Intel Xeon X3220 processor (2.4 GHz), 4GB of RAM, a 
500GB hard disk and Ubuntu 11.04. With scale factor 10, the execution time of 
Q17 is 1216.94s without the patch versus 713.581s with the patch, and that of 
Q18 is 1737.18s without the patch versus 867.334s with the patch. 

I am facing a issue which I have not found a good way to solve. Suppose that we 
have a query "SELECT * FROM (SELECT L.c1 as c11, R.c2 as c12 FROM L JOIN R ON 
L.c1=R.C2) t1 JOIN (SELECT R.c1 as c21, count(distinct R.c2) as c22 FROM R 
GROUP BY R.c1) ON t1.c11=t2.c21". In this query, only one MapReduce job is 
necessary. However, because Hive will use R.c1 and R.c2 as the key columns of 
the original ReduceSinkOperator for the sub-query involving distinct count 
function, it is impossible to merged MapReduce jobs of two sub-queries into 
one. To optimize this kind of query, I write a new UDF function 
count_distinct(...), e.g. count_distinct(R.c2). This count_distinct function 
use a HashSet to get the number of distinct records. Is there any better 
solution for optimizing this kind of queries? Thanks.     

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to