[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101952#comment-13101952 ]
Yin Huai commented on HIVE-2206: -------------------------------- Almost finish the patch. I did a preliminary test based on TPC-H Q17 and Q18. My machine has a quad-core Intel Xeon X3220 processor (2.4 GHz), 4GB of RAM, a 500GB hard disk and Ubuntu 11.04. With scale factor 10, the execution time of Q17 is 1216.94s without the patch versus 713.581s with the patch, and that of Q18 is 1737.18s without the patch versus 867.334s with the patch. I am facing a issue which I have not found a good way to solve. Suppose that we have a query "SELECT * FROM (SELECT L.c1 as c11, R.c2 as c12 FROM L JOIN R ON L.c1=R.C2) t1 JOIN (SELECT R.c1 as c21, count(distinct R.c2) as c22 FROM R GROUP BY R.c1) ON t1.c11=t2.c21". In this query, only one MapReduce job is necessary. However, because Hive will use R.c1 and R.c2 as the key columns of the original ReduceSinkOperator for the sub-query involving distinct count function, it is impossible to merged MapReduce jobs of two sub-queries into one. To optimize this kind of query, I write a new UDF function count_distinct(...), e.g. count_distinct(R.c2). This count_distinct function use a HashSet to get the number of distinct records. Is there any better solution for optimizing this kind of queries? Thanks. > add a new optimizer for query correlation discovery and optimization > -------------------------------------------------------------------- > > Key: HIVE-2206 > URL: https://issues.apache.org/jira/browse/HIVE-2206 > Project: Hive > Issue Type: New Feature > Reporter: He Yongqiang > Assignee: Yin Huai > Attachments: Queries, YSmartPatchForHive.patch > > > reference: > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira