[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2014-07-15 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061776#comment-14061776
 ] 

Lefty Leverenz commented on HIVE-2206:
--

*hive.optimize.correlation* is documented here:

* [Configuration Properties -- hive.optimize.correlation | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.correlation]

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.21.patch, HIVE-2206.D11097.22.patch, 
 HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, 
 HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, 
 HIVE-2206.D11097.9.patch, HIVE-2206.patch, YSmartPatchForHive.patch, 
 testQueries.2.q


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2014-07-14 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060415#comment-14060415
 ] 

Lefty Leverenz commented on HIVE-2206:
--

The correlation optimizer is documented here:

* [Correlation Optimizer | 
https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer]

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.21.patch, HIVE-2206.D11097.22.patch, 
 HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, 
 HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, 
 HIVE-2206.D11097.9.patch, HIVE-2206.patch, YSmartPatchForHive.patch, 
 testQueries.2.q


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2014-07-13 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060290#comment-14060290
 ] 

Lefty Leverenz commented on HIVE-2206:
--

This added *hive.optimize.correlation* in HiveConf.java with a description in 
hive-default.xml.template, so the parameter needs to be documented in the wiki 
(Configuration Properties).

Note that HIVE-7362 proposes to change the default for 
*hive.optimize.correlation* to true.

General documentation for the correlation optimizer is covered by HIVE-5130.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
  Labels: TODOC12
 Fix For: 0.12.0

 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.21.patch, HIVE-2206.D11097.22.patch, 
 HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, 
 HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, 
 HIVE-2206.D11097.9.patch, HIVE-2206.patch, YSmartPatchForHive.patch, 
 testQueries.2.q


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc



--
This message was sent by Atlassian JIRA

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-09-19 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772336#comment-13772336
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has closed the revision HIVE-2206 [jira] add a new optimizer for query 
correlation discovery and optimization.

  Closed by commit rHIVE1504395 (authored by hashutosh).

CHANGED PRIOR TO COMMIT
  https://reviews.facebook.net/D11097?vs=39099id=40161#toc

REVISION DETAIL
  https://reviews.facebook.net/D11097

COMMIT
  https://reviews.facebook.net/rHIVE1504395

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.21.patch, HIVE-2206.D11097.22.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, HIVE-2206.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-08-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13751563#comment-13751563
 ] 

Yin Huai commented on HIVE-2206:


the last patch was for HIVE-5149...

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.21.patch, HIVE-2206.D11097.2.patch, 
 HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, 
 HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, 
 HIVE-2206.D11097.9.patch, HIVE-2206.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-31 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726019#comment-13726019
 ] 

Yin Huai commented on HIVE-2206:


thanks [~sershe]. I will make the change

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, HIVE-2206.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-31 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726023#comment-13726023
 ] 

Yin Huai commented on HIVE-2206:


i opened https://issues.apache.org/jira/browse/HIVE-4972 to update code 
generated by thrift

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, HIVE-2206.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-19 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713831#comment-13713831
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  Please ignore the latest diff it is for HIVE-4877...

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-4877

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.20.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, HIVE-2206.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712046#comment-13712046
 ] 

Edward Capriolo commented on HIVE-2206:
---

Ok this patch has been +1 multiple times. Was even +1 a year ago. Can we commit 
now? I would hate to have to see Yin rebase again.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712164#comment-13712164
 ] 

Gunther Hagleitner commented on HIVE-2206:
--

[~yhuai] Thanks for sticking with this. This is awesome!

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712239#comment-13712239
 ] 

Hudson commented on HIVE-2206:
--

FAILURE: Integrated in Hive-trunk-h0.21 #2205 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2205/])
HIVE-2206 [jira] add a new optimizer for query correlation discovery and 
optimization
(Yin Huai via Ashutosh Chauhan)

Summary:
update test results

This issue proposes a new logical optimizer called Correlation Optimizer, which 
is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The 
idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and 
slides of YSmart are linked at the bottom.

Since Hive translates queries in a sentence by sentence fashion, for every 
operation which may need to shuffle the data (e.g. join and aggregation 
operations), Hive will generate a MapReduce job for that operation. However, 
for those operations which may need to shuffle the data, they may involve 
correlations explained below and thus can be executed in a single MR job.

Input Correlation: Multiple MR jobs have input correlation (IC) if 
their input relation sets are not disjoint;
Transit Correlation: Multiple MR jobs have transit correlation (TC) if 
they have not only input correlation, but also the same partition key;
Job Flow Correlation: An MR has job flow correlation (JFC) with one of 
its child nodes if it has the same partition key as that child node.

The current implementation of correlation optimizer only detect correlations 
among MR jobs for reduce-side join operators and reduce-side aggregation 
operators (not map only aggregation). A query will be optimized if it satisfies 
following conditions.

There exists a MR job for reduce-side join operator or reduce side 
aggregation operator which have JFC with all of its parents MR jobs (TCs will 
be also exploited if JFC exists);
All input tables of those correlated MR job are original input tables 
(not intermediate tables generated by sub-queries); and
No self join is involved in those correlated MR jobs.

Correlation optimizer is implemented as a logical optimizer. The main reasons 
are that it only needs to manipulate the query plan tree and it can leverage 
the existing component on generating MR jobs.

Current implementation can serve as a framework for correlation related 
optimizations. I think that it is better than adding individual optimizers.

There are several work that can be done in future to improve this optimizer. 
Here are three examples.

Support queries only involve TC;
Support queries in which input tables of correlated MR jobs involves 
intermediate tables; and
Optimize queries involving self join.

References:
Paper and presentation of YSmart.
Paper: 
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

Test Plan: EMPTY

Reviewers: JIRA, ashutoshc

Reviewed By: ashutoshc

CC: brock

Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1504395)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/if/queryplan.thrift
* 
/hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
* 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712295#comment-13712295
 ] 

Brock Noland commented on HIVE-2206:


Nice work Yin! This was one hell of an effort!

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712371#comment-13712371
 ] 

Yin Huai commented on HIVE-2206:


Thank you, guys!

Thanks Ashutosh and Gunther for your help! This patch has been improved a lot. 
I really appreciate your comments and I really enjoy our discussions:)

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712375#comment-13712375
 ] 

Edward Capriolo commented on HIVE-2206:
---

Pat yourself on the back and take a well earned rest!   

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712452#comment-13712452
 ] 

Yin Huai commented on HIVE-2206:


Thanks Edward:)

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712513#comment-13712513
 ] 

Prasanth J commented on HIVE-2206:
--

Finally (after 2 yrs) it's in! Great job [~yhuai]

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712597#comment-13712597
 ] 

Hudson commented on HIVE-2206:
--

SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #90 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/90/])
HIVE-2206 [jira] add a new optimizer for query correlation discovery and 
optimization
(Yin Huai via Ashutosh Chauhan)

Summary:
update test results

This issue proposes a new logical optimizer called Correlation Optimizer, which 
is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The 
idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and 
slides of YSmart are linked at the bottom.

Since Hive translates queries in a sentence by sentence fashion, for every 
operation which may need to shuffle the data (e.g. join and aggregation 
operations), Hive will generate a MapReduce job for that operation. However, 
for those operations which may need to shuffle the data, they may involve 
correlations explained below and thus can be executed in a single MR job.

Input Correlation: Multiple MR jobs have input correlation (IC) if 
their input relation sets are not disjoint;
Transit Correlation: Multiple MR jobs have transit correlation (TC) if 
they have not only input correlation, but also the same partition key;
Job Flow Correlation: An MR has job flow correlation (JFC) with one of 
its child nodes if it has the same partition key as that child node.

The current implementation of correlation optimizer only detect correlations 
among MR jobs for reduce-side join operators and reduce-side aggregation 
operators (not map only aggregation). A query will be optimized if it satisfies 
following conditions.

There exists a MR job for reduce-side join operator or reduce side 
aggregation operator which have JFC with all of its parents MR jobs (TCs will 
be also exploited if JFC exists);
All input tables of those correlated MR job are original input tables 
(not intermediate tables generated by sub-queries); and
No self join is involved in those correlated MR jobs.

Correlation optimizer is implemented as a logical optimizer. The main reasons 
are that it only needs to manipulate the query plan tree and it can leverage 
the existing component on generating MR jobs.

Current implementation can serve as a framework for correlation related 
optimizations. I think that it is better than adding individual optimizers.

There are several work that can be done in future to improve this optimizer. 
Here are three examples.

Support queries only involve TC;
Support queries in which input tables of correlated MR jobs involves 
intermediate tables; and
Optimize queries involving self join.

References:
Paper and presentation of YSmart.
Paper: 
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

Test Plan: EMPTY

Reviewers: JIRA, ashutoshc

Reviewed By: ashutoshc

CC: brock

Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1504395)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/if/queryplan.thrift
* 
/hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
* 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712670#comment-13712670
 ] 

Hudson commented on HIVE-2206:
--

FAILURE: Integrated in Hive-trunk-hadoop2-ptest #19 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/19/])
HIVE-2206 [jira] add a new optimizer for query correlation discovery and 
optimization
(Yin Huai via Ashutosh Chauhan)

Summary:
update test results

This issue proposes a new logical optimizer called Correlation Optimizer, which 
is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The 
idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and 
slides of YSmart are linked at the bottom.

Since Hive translates queries in a sentence by sentence fashion, for every 
operation which may need to shuffle the data (e.g. join and aggregation 
operations), Hive will generate a MapReduce job for that operation. However, 
for those operations which may need to shuffle the data, they may involve 
correlations explained below and thus can be executed in a single MR job.

Input Correlation: Multiple MR jobs have input correlation (IC) if 
their input relation sets are not disjoint;
Transit Correlation: Multiple MR jobs have transit correlation (TC) if 
they have not only input correlation, but also the same partition key;
Job Flow Correlation: An MR has job flow correlation (JFC) with one of 
its child nodes if it has the same partition key as that child node.

The current implementation of correlation optimizer only detect correlations 
among MR jobs for reduce-side join operators and reduce-side aggregation 
operators (not map only aggregation). A query will be optimized if it satisfies 
following conditions.

There exists a MR job for reduce-side join operator or reduce side 
aggregation operator which have JFC with all of its parents MR jobs (TCs will 
be also exploited if JFC exists);
All input tables of those correlated MR job are original input tables 
(not intermediate tables generated by sub-queries); and
No self join is involved in those correlated MR jobs.

Correlation optimizer is implemented as a logical optimizer. The main reasons 
are that it only needs to manipulate the query plan tree and it can leverage 
the existing component on generating MR jobs.

Current implementation can serve as a framework for correlation related 
optimizations. I think that it is better than adding individual optimizers.

There are several work that can be done in future to improve this optimizer. 
Here are three examples.

Support queries only involve TC;
Support queries in which input tables of correlated MR jobs involves 
intermediate tables; and
Optimize queries involving self join.

References:
Paper and presentation of YSmart.
Paper: 
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

Test Plan: EMPTY

Reviewers: JIRA, ashutoshc

Reviewed By: ashutoshc

CC: brock

Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1504395)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/if/queryplan.thrift
* 
/hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
* 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-17 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711305#comment-13711305
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  Have addressed some comments. Will address the rest of comments later.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 These OIs 
are not needed. I have removed them.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 
JoinOperators relies on the tag to function correctly.  I will add comment to 
explain why we need revert the newTag to oldTag.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Done.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Done.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Yes, I have 
changed it to numParents = getNumParent();
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Done. Since 
there is another check in initializeOp, I will throw the exception at there.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 Done

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-17 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711556#comment-13711556
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  Another check point. Will finish soon and generate new patch

INLINE COMMENTS
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250
 this function is not needed. I have deleted it.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590
 it is not used. I have deleted it.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630
 I have deleted it.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647
 I have deleted it. We can extend the scope of this optimizer in a follow-up 
jira.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83
 Done

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-17 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711617#comment-13711617
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

INLINE COMMENTS
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171
 table should not be null at here. I will throw an exception when we have 
table==null
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153
 Since CommonJoinTaskDispatcher is in the phase of physical optimization, seems 
that we cannot refactor this part of code in an easy way. I suggest refactoring 
it in a follow-up jira.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-17 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711766#comment-13711766
 ] 

Phabricator commented on HIVE-2206:
---

ashutoshc has accepted the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  +1 Awesome work, Yin!
  Beautiful ascii art too : ) Finally some great comments in code. : )

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130716

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-17 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712037#comment-13712037
 ] 

Hive QA commented on HIVE-2206:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12592900/HIVE-2206.patch

{color:green}SUCCESS:{color} +1 all tests passed

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/71/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/71/console

Messages:
Executing org.apache.hive.ptest.execution.CleanupPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase

This message is automatically generated.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710219#comment-13710219
 ] 

Phabricator commented on HIVE-2206:
---

ashutoshc has requested changes to the revision HIVE-2206 [jira] add a new 
optimizer for query correlation discovery and optimization.

  Minor comments, mostly around improving documentation in code.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Does 
this patch makes this necessary? Or, you added it just for completeness?
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Better to 
do it as ListObject thisRow = (ListObject) row;
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Will be 
good to add comments for all these maps. What mappings they are tracking?
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Will be 
good to add some ascii art showing an example of such a plan.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Is this 
necessary?
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 I understand 
this but it will be confusing for someone reading this comment for first time 
because before this patch RS operator is always in map side. We need to reword 
this so its easier to read.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Can you add a 
comment when this boolean will be true and when it will be false.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Lets throw an 
exception here. if (childOperatorsArray.length != 1) throw new HiveException 
(Expected number of children is 1. Found :  + childOperatorsArray.length)
  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 This should not 
be required. You can always get all the values of enum by using valueOf() 
method on enum.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 It will 
be good to add javadoc for this explaining why we should leave it as it is?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45
 It will be good to add javadoc for this class.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Shane Pratt (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710227#comment-13710227
 ] 

Shane Pratt commented on HIVE-2206:
---

Thank you for your message.

I will be traveling the next several days so there may be a delay in my 
response to your email.

If you need to reach me now, please call the number below.  Otherwise, I will 
respond to you as soon as I can.


Shane
512-590-3925




 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710608#comment-13710608
 ] 

Phabricator commented on HIVE-2206:
---

ashutoshc has commented on the revision HIVE-2206 [jira] add a new optimizer 
for query correlation discovery and optimization.

  Few more comments. See which of these apply. If they doesn't apply, feel free 
to ignore.

INLINE COMMENTS
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250
 What does this function do?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171
 Will be good to add comment stating when table == null
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153
 It seems like lot of logic here is shared with CommonJoinTaskDispatcher. It 
will be good to have that refactored so that its reusable here.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284
 Seems like this method always return true. So, this is not required.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271
 Add a comment saying that tree walking is done and now you will apply 
transformations which you have detected.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590
 Do we really need hasBeenRemoved() check?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597
 getKeyCols().size() is not a good check. I will recommend to test explictly 
for operators which we are supporting right now.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630
 Do we still need this?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647
 We should do jobFlowCorrelation as another pass in transform().
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526
 It will be good to add some ascii art which shows what tree structure we are 
returning from this function.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368
 It will good to add javadoc for this method.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453
 Didn't understand this comment. Probably we can word it better.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 I dont think 
we need to revert to oldTag here. We can keep using newTag.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 Doesnt 
look like you are using these OIs. Probably we can get rid of these.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 It will be 
good to add comments for whats the intent of this for loop.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Why is it 
called NumOriginalParents? can it be just numOfParents
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 There is 
already a forward in Demux, this should not be needed.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 You dont need 
this constructor
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83
 Looks like this map is not used anymore, lets get rid of this.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79
 It will be good to add comments about what this method is intending to do
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67
 This method straight away calls another method. We can eliminate this one.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710687#comment-13710687
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  Add an explanation on startGroup. Will start to address the rest of comments 
tomorrow.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Since 
we can have a operator tree with multiple JoinOperators and GroupByOperators 
inside, we need to propagate the startGroup to all operators in the operator 
tree. For queries which are not optimized by this patch, we can have at most 1 
JoinOperator (at the beginning of the reduce-side) and 2 GroupByOperators (1 at 
the beginning of the reduce-side one and 1 hash mode one just before 
FileSinkOperator). This change will not affect those operators.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Please 
see my reply to the same change made in CommonJoinOperator
  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 Seems an enum 
does not have a method to return a list of values with the type of string.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-02 Thread Shane Pratt (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698115#comment-13698115
 ] 

Shane Pratt commented on HIVE-2206:
---

Thank you for your message.

I am out of the office on vacation for the remainder of the week.  If this is 
an emergency, please call me at the number below.

Otherwise, I'll respond to your message when I return.

Shane
512-590-3925




 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-07 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678116#comment-13678116
 ] 

Phabricator commented on HIVE-2206:
---

brock has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 Looks like it's 
there because ArrayList defines a clone() method.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 I agree 
that hive does this often. I don't mean to suggest you should fix this in all 
of hive in the patch but let's not add any additional printStackTraces. I see 
one additional new printStackTrace in your patch. Would you mind removing that 
one as well?

REVISION DETAIL
  https://reviews.facebook.net/D11097

To: JIRA, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-06 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677381#comment-13677381
 ] 

Yin Huai commented on HIVE-2206:


update the diff at https://reviews.facebook.net/D11097. Fixed two bugs. All 
unit test pass when the optimizer is turned off by default. I am evaluating if 
there is any issue when the optimizer is turned on by default.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-06 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1360#comment-1360
 ] 

Phabricator commented on HIVE-2206:
---

brock has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  I was just casually reading this patch and noted a few items.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 If we 
are throwing the exception do we need to print the exception? Also, this should 
be logged not printed.
  ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 We should be 
returning list of collection not arraylist no? There are a few other 
occurrences of this.

REVISION DETAIL
  https://reviews.facebook.net/D11097

To: JIRA, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-06 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677791#comment-13677791
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 Did not 
notice it before. I copied the code of closeOp. I do not think we need to print 
the exception. I will change this class. Also, seems printing the exception 
also appear in lots of other places. If we want to need to remove all of them, 
we need a separate jira.
  ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 MuxOperator is 
used to replace ReduceSinkOperators in an MR job optimized by this optimizer. I 
basically follow the code of ReduceSinkDesc. Seems the reason that ArrayList is 
used is for clone. I will leave ArrayList at here right now.

REVISION DETAIL
  https://reviews.facebook.net/D11097

To: JIRA, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-05 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676441#comment-13676441
 ] 

Ashutosh Chauhan commented on HIVE-2206:


In your testcases, some of the patterns you have (e.g., like Join followed by 
GBY) on same keys, I assume reducesink reduplication optimization will already 
take care of it such that it will generate only 1 MR job. Is that correct? Is 
it that for all of your testcases reducesink dedup optimization will not fire. 
If its former, than it will be good to identify which of those cases are 
already taken care by RS dedup. If its latter, than it will be good to know why 
reducesink dedup optimization is not kicking in for those.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676462#comment-13676462
 ] 

Yin Huai commented on HIVE-2206:


RS dedup is on by default. So the explain without CorrelationOptimizer should 
be optimized by RS dedup. But, seems that it does not fire in any of my cases. 
Will take a look at it later.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676643#comment-13676643
 ] 

Yin Huai commented on HIVE-2206:


Just found I need to set false for both hive.auto.convert.join and 
hive.auto.convert.join.noconditionaltask to let RS dedup to work on cases with 
join. I just tried two cases. It works on 
{code:sql}
SELECT x.key AS key, count(1) AS cnt FROM src1 x JOIN src y ON (x.key = y.key) 
GROUP BY x.key
{\code}, and it does work on
{code}
SELECT xx.key, xx.cnt, yy.key, yy.cnt
FROM
(SELECT x.a as key, count(*) AS cnt FROM src x group by x.a) xx
JOIN
(SELECT y.a as key, count(*) AS cnt FROM src1 y group by y.a) yy
ON (xx.key=yy.key);
{\code}

I suggest that we let CorrelationOptimizer to handle cases involving join 
because it supports more cases and has included needed mechanisms.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675569#comment-13675569
 ] 

Yin Huai commented on HIVE-2206:


HIVE-2206.D11097.1.patch is the latest patch for the trunk. I have heavily 
refactored my code. Here are major changes.
# If multiple operation paths share the same input table, I just use a single 
TableScanOperator and add the bottom operators of these paths as children of 
this common TableScanOperator. I do not do any deduplication of common columns 
because deduplication will significantly make the code more complicated and may 
introduce more problems. If we want to do deduplication, I suggest to tackle it 
later in a followup work.
# Without deduplicating columns, the dispatcher at the reduce side has less 
work to do and some queries involving self join can be optimized in the current 
version.
# The fake ReduceSinkOperator (CorrelationLocalSimulativeReduceSinkOperator... 
I will change the name later) does not do serialization and deserialization as 
appearing in the previous one.
# New test cases are added.
# I also refactor the code ReduceSinkDeDupplication since CorrelationOptimizer 
can reuse some methods introduced by ReduceSinkDeDupplication. [~navis] can you 
take a look at it and see if my changes make sense?

I will run all unit tests soon and will also add more comments.

btw, there is a issue in correlationoptimizer2.q. Optimized plans cannot 
generate rows that both join keys (from the left table and right table) are 
null values for outer joins. I am looking at it

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-02-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575844#comment-13575844
 ] 

Yin Huai commented on HIVE-2206:


[~ashutoshc] have you got a time to look at the patch?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-18 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557411#comment-13557411
 ] 

Ashutosh Chauhan commented on HIVE-2206:


I am having second thoughts on cloning. Cloning graphs (like query plan) or 
dense structures (like ParseContext) is fraught with perils. Its likely that 
cloning will require new code and arguably have hard to detect bugs, since we 
need to track down every single pointer and clone all the way through. I think 
to avoid such issues and for simplicity, we can drop the cloning idea. The 
feature is anyway behind the config option which is off default, so query-plan 
will be modified only for the users who turn the flag on. 
Yin, if you have addressed my other comments, can you update the patch on RB 
and upload here on jira, I will take another look at it.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-18 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557416#comment-13557416
 ] 

Ashutosh Chauhan commented on HIVE-2206:


Oh.. I see that you have already updated RB and jira. I will take a look at it 
soon.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-16 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555155#comment-13555155
 ] 

Yin Huai commented on HIVE-2206:


So, if a Map join is involved in a plan and the output of this join will be 
consumed by another subsequent operator, we should leave the Map join in the 
Map phase of the subsequent operator instead of cutting Map join to a separate 
MR job. Is my understanding correct? If so, I think that we should change the 
process on generating Map join Operators. After generating a Map join operator, 
we do not insert a FileSinkOperator after the Map Join. Seems that Rule 11 in 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genMapRedTasks is the rule for 
generating a separate MR job for a Map Join.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-16 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555491#comment-13555491
 ] 

Ashutosh Chauhan commented on HIVE-2206:


I ran my tests with HIVE-3784 and I got a single MR job for my query (i.e., 
mapjoin followed by group-by on different keys) gets you a single MR job. Thats 
cool.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-15 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554712#comment-13554712
 ] 

Ashutosh Chauhan commented on HIVE-2206:


I did some testing of this on our use-cases. Lets say you have two tables:

{code}
create table t1 (a int, b int);
create table t2 (c int, d int);
select a from ( select * from t1 join t2 on (t1.a = t2.c) )e group by a;
select a from ( select/*+ MAPJOIN(t2) */ * from t1 join t2 on (t1.a = t2.c) )e 
group by a;
{code}

Now, ysmart is able to optimize first query fine, where it fuses 2MR jobs in 1 
MR, since join and group-by has same key. 
However this doesn't work with 2nd query which has mapjoin. This results in 2 
MR job. This is especially important since in map-join case you don't need the 
condition of join-key being same as groupby key, which is *very* important. In 
our use-cases, we have observed its rarely the case that join and group-by is 
on same key. But, in most cases we are able to utilize map-join, since data 
which we are joining on is small enough. And than subsequent group-by which is 
on different key can happen on the reduce side of this single MR job.
Any thoughts on how this could be achieved ? 
Though, it looks like for this use-case we don't need ysmart since we don't 
need to detect any correlation. We can walk on query plan and can always fuse 
map-side join followed by group-by (i.e., no need to detect any correlation 
among inputs or to detect whether join and group-by keys are same).

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-13 Thread Liu Zongquan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552281#comment-13552281
 ] 

Liu Zongquan commented on HIVE-2206:


[~yhuai] I have a question that why not release a patch upon a stable hive 
release, e.g,branch hive-0.8-r2. Actually I found that the r1410581 is not a 
stable revision and even I can't run through ant test -Dtestcase=TestCliDriver 
-Dqfile=show_functions.q -Doverwrite=true on this revision. So, if this patch 
is based on a stable version, espectially a stable branch, then your honor work 
will benefit more people. Even so ,just a suggestion. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-13 Thread David Inbar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552282#comment-13552282
 ] 

David Inbar commented on HIVE-2206:
---

I will be on vacation through January 14th, but will be checking email and 
voicemail periodically.

For all time-critical items, please call my mobile phone.

Many thanks,
David

NOTICE: All information in and attached to this email may be proprietary, 
confidential, privileged and otherwise protected from improper or erroneous 
disclosure. If you are not the sender's intended recipient, you are not 
authorized to intercept, read, print, retain, copy, forward, or disseminate 
this message.



 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552290#comment-13552290
 ] 

Ashutosh Chauhan commented on HIVE-2206:


If Yin wants to provide a patch against a stable (or any) branch, thats his 
choice. But, for patch to get committed it needs to get committed on trunk 
first.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547982#comment-13547982
 ] 

Hudson commented on HIVE-2206:
--

Integrated in Hive-trunk-hadoop2 #54 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/54/])
HIVE-2206:add a new optimizer for query correlation discovery and 
optimization (Yin Huai via He Yongqiang) (Revision 1392105)

 Result = ABORTED
heyongqiang : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1392105
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* 
/hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
* /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Liu Zongquan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548809#comment-13548809
 ] 

Liu Zongquan commented on HIVE-2206:


[~yhuai] Thanks so much!

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549298#comment-13549298
 ] 

Ashutosh Chauhan commented on HIVE-2206:


This patch looks useful, especially since once we have this in, it will open up 
other optimization possibilities. I have left some comment on 
https://reviews.apache.org/r/7126/ 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549301#comment-13549301
 ] 

Ashutosh Chauhan commented on HIVE-2206:


Also, can this work enable or facilitate implementation of optimization which 
is getting discussed on HIVE-3773 ?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-07 Thread Liu Zongquan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545984#comment-13545984
 ] 

Liu Zongquan commented on HIVE-2206:


If I plan to merge HIVE-2206 into the hive source code, which branch should I 
use? Can someone tell me?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-07 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546150#comment-13546150
 ] 

Yin Huai commented on HIVE-2206:


[~liuzongquan] The latest patch was developed based on hive trunk revision 
1410581.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505495#comment-13505495
 ] 

Yin Huai commented on HIVE-2206:


[~cwsteinbach] I am not sure if unit tests in Hive are comprehensive enough. If 
not, it might be better that we turn on this optimizer by default in future 
after we can use more queries to test it.

I just tested all unit tests with an enabled correlation optimizer. Because, if 
map side aggregation is on, correlation optimizer also requires regular reduce 
side aggregation to be generated, if cube or rollup is used in the query, 
error message 10209 
(org.apache.hadoop.hive.ql.ErrorMsg.HIVE_GROUPING_SETS_AGGR_NOMAPAGGR) will be 
thrown. Seems HIVE-3508 can solve this issue. Except this issue, a few query 
plans need to be re-generated because of changing operator ids.

This jira has taken a long time. Can we wrap it up and I will start to work on 
follow-up jiras.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500469#comment-13500469
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yin: The correlation optimizer is only enabled for a small set of new 
CliDriver tests. If I enable the correlation optimizer by default, which of the 
existing CliDriver tests are expected to fail?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread David Inbar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500474#comment-13500474
 ] 

David Inbar commented on HIVE-2206:
---

I will be on vacation through Friday Nov 23rd, but will be checking email and 
voicemail periodically.

For all time-critical items, please call my mobile phone.

Many thanks,
David

NOTICE: All information in and attached to this email may be proprietary, 
confidential, privileged and otherwise protected from improper or erroneous 
disclosure. If you are not the sender's intended recipient, you are not 
authorized to intercept, read, print, retain, copy, forward, or disseminate 
this message.



 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500499#comment-13500499
 ] 

Yin Huai commented on HIVE-2206:


[~cwsteinbach]
If the optimizer is enabled by default, based on my last tests, only 
auto_join26.q is expected to fail, because it will be optimized by correlation 
optimizer. But, except the query plan, the query result of auto_join26.q is 
correct. Also, once I finished HIVE-3671 (I am working on it right now), the 
failure of auto_join26.q should be eliminated.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500626#comment-13500626
 ] 

Carl Steinbach commented on HIVE-2206:
--

I'm surprised that auto_join26 is the only test that fails due to different 
EXPLAIN output. Is that because this optimization doesn't affect the queries in 
most tests, or because we don't consistently call EXPLAIN in the tests?

What is preventing us from enabling this by default right now?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499858#comment-13499858
 ] 

Yin Huai commented on HIVE-2206:


[~namit]
Sure. I just took a look at the code. Seems that once I get all content 
summaries of input table, I can make the guess on if join auto resolver will 
work for join operators on input tables. Because, as far as I know, existing 
util functions on retrieving content summaries (called after logical 
optimization) cannot be used directly at here, I need to write some util 
functions to get sizes of input tables. I will start to work on this asap. 
Also, although HIVE-3671 seems not hard to do, but it is not a quick fix. I 
suggest we track this work in a separate jira.

[~cwsteinbach]
Have you got time to look at current patch? Any comment?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-14 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497756#comment-13497756
 ] 

Namit Jain commented on HIVE-2206:
--

It would be a good idea to get HIVE-3671 in this patch.
With HIVE-3671, the functionality will be much more useful to the whole 
community.
[~yhuai], can you investigate getting HIVE-3671 as part of this patch, and see 
how much
work is it ? Based on that, we can proceed.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496399#comment-13496399
 ] 

He Yongqiang commented on HIVE-2206:


+1, i will commit after tests pass.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496521#comment-13496521
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: Can you please hold off on committing while I take another look? 
Thanks.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496528#comment-13496528
 ] 

He Yongqiang commented on HIVE-2206:


@carl, you can go ahead comment, huai will address them in a sperate diff. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496529#comment-13496529
 ] 

He Yongqiang commented on HIVE-2206:


@Carl, keep in mind that you already months of time to comment. So maybe 
addressing your comments in new jiras will make more sense.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496658#comment-13496658
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: Please hold off on committing this for a day. Thanks.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496697#comment-13496697
 ] 

He Yongqiang commented on HIVE-2206:


okay, i will target commit it this weekend or earlier next week.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496701#comment-13496701
 ] 

Carl Steinbach commented on HIVE-2206:
--

Thanks!

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13490681#comment-13490681
 ] 

Yin Huai commented on HIVE-2206:


[~namit]
Sure. I created the umbrella jira (HIVE-3667) for all work related to 
correlation optimizer and also created several follow-up jiras as sub-tasks. 
You can also add other sub-tasks into that jira.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-04 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13490432#comment-13490432
 ] 

Namit Jain commented on HIVE-2206:
--

[~yhuai], can you file follow-up jiras for the cases that dont work with this 
optimization ?
It would be good to link them along with this jira. Adding them in the wiki 
would be useful too for tracking.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-20 Thread alex gemini (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480878#comment-13480878
 ] 

alex gemini commented on HIVE-2206:
---

Did this jira have a short version description? I know a join followed by group 
is optimized like pipeline, what else we may want to add to wiki?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480881#comment-13480881
 ] 

Yin Huai commented on HIVE-2206:


[~gemini5201314]
I do not have a short version description right now. Let me write one and 
create a wiki page.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-01 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466937#comment-13466937
 ] 

He Yongqiang commented on HIVE-2206:


I will be on vacation this whole week. Given this is a very big diff, I will 
keep this open for another one week or two for more comments. 


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-01 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13467067#comment-13467067
 ] 

Yin Huai commented on HIVE-2206:


I just found I can remove the first phase of this optimizer. Apparently there 
were changes in the trunk, so I do not need to save original ColumnExprMap and 
OpParseCtx. I have removed unnecessary code and are running tests. Will update 
the patch later.


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466552#comment-13466552
 ] 

He Yongqiang commented on HIVE-2206:


All tests passed for me.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466576#comment-13466576
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: I don't see a +1 vote in this JIRA. According to the project bylaws 
(https://cwiki.apache.org/confluence/display/Hive/Bylaws) this patch should not 
have been committed. Please back this patch out. Thanks.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466580#comment-13466580
 ] 

He Yongqiang commented on HIVE-2206:


I commented that all tests passed.

ok, +1.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466582#comment-13466582
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: Sorry, but that's not the way it works. You vote +1 first, wait 24 
hours, and then commit the patch. This is all covered in the project bylaws. 
Please revert this patch. Thanks.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466581#comment-13466581
 ] 

He Yongqiang commented on HIVE-2206:


@Carl, btw, i did mentioned a few times on the comments that i am planing to 
commit this one.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466584#comment-13466584
 ] 

He Yongqiang commented on HIVE-2206:


I did not see a 24 hours waiting on the bylaw page?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466625#comment-13466625
 ] 

He Yongqiang commented on HIVE-2206:


@Carl, i just reverted. I will commit again tomorrow.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466627#comment-13466627
 ] 

Namit Jain commented on HIVE-2206:
--

Sorry for jumping in late on this. This is a pretty big feature - can you give 
me sometime to review this as well ?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466631#comment-13466631
 ] 

Carl Steinbach commented on HIVE-2206:
--

bq. I did not see a 24 hours waiting on the bylaw page?

This is specified in the minimum length column in the table that appears in 
the Actions section of the bylaws document. We could definitely make this 
easier to undertand, but all of the other committers already follow the 
convention that you +1 a patch before committing it, and allow some time to 
elapse in between those two actions in order to give other people a chance to 
weigh in.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466638#comment-13466638
 ] 

Hudson commented on HIVE-2206:
--

Integrated in Hive-trunk-h0.21 #1711 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/1711/])
HIVE-2206:add a new optimizer for query correlation discovery and 
optimization (Yin Huai via He Yongqiang) (Revision 1392105)

 Result = FAILURE
heyongqiang : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1392105
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* 
/hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
* /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463855#comment-13463855
 ] 

Yin Huai commented on HIVE-2206:


I corrected my local configurations related to HBase and checked out HIVE-3507, 
now all tests pass. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-25 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13462865#comment-13462865
 ] 

Yin Huai commented on HIVE-2206:


updated patch at reviewboard.

@Carl: Pleas also see my comments under yours. Thanks.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457114#comment-13457114
 ] 

Yin Huai commented on HIVE-2206:


Carl:
The main reason that Yongqiang and I decided to disable this feature by default 
first is that we have not got a chance to test this optimizer heavily.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-17 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457484#comment-13457484
 ] 

He Yongqiang commented on HIVE-2206:


The current patch looks ok. 
@Carl, please give more specific comments. 

We should agree on that new big features should not be enabled by default. 
That's too risky. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456511#comment-13456511
 ] 

Yin Huai commented on HIVE-2206:


Opened a new review request at https://reviews.apache.org/r/7126/, since I have 
been working on hive-git. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-07-28 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424315#comment-13424315
 ] 

He Yongqiang commented on HIVE-2206:


For the last few months (almost one year), Yin has been actively maintaining 
this patch, and i think it is in a very good state to check into trunk. 

So i will do some final review, and hope to commit it sometime next month. 
Please feel free to jump in to review the patch and put any comments here 
before the commit.

In the last review, I will make sure this patch will not have big changes to 
existing execution path, so it can be simply disabled like other optimizations 
in Hive. And Yin will still be actively maintaining this patch (help fix  bugs 
etc) after the commit. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 YSmartPatchForHive.patch, testQueries.2.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-05-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276409#comment-13276409
 ] 

Yin Huai commented on HIVE-2206:


@anders,
I will try to make this patch more concise and easier to read than the current 
version. If I have any thought on the optimization framework, I will comment 
under HIVE-3027.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 YSmartPatchForHive.patch, testQueries.2.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-02-10 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205570#comment-13205570
 ] 

jirapos...@reviews.apache.org commented on HIVE-2206:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/#review4912
---


I've started reviewing this, here's my comments so far.  I'll continue to look 
over it.


trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
https://reviews.apache.org/r/2001/#comment11010

Does this have to default to false, does anything break if it's true?

Similarly, have you tried running the tests with this set to true?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
https://reviews.apache.org/r/2001/#comment10818

It's not clear to me why we need both setRowNumber and processOp.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
https://reviews.apache.org/r/2001/#comment10817

Putting this code in a helper method would be better than having it both 
here and in setRowNumber.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
https://reviews.apache.org/r/2001/#comment10819

Does this commented out code need to be kept?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
https://reviews.apache.org/r/2001/#comment10820

I couldn't find a CorrelationFakeReduceSinkOperator class.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
https://reviews.apache.org/r/2001/#comment10821

Tabs are bad, could you change them to spaces, at least in the new code 
your introducing.



trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
https://reviews.apache.org/r/2001/#comment10850

I take it from this line it's a requirement that in order for this 
correlation optimization to be attempted every reduce sink has to be followed 
only by children with a single child.

Could this be relaxed?  Could the optimization simply not be applied if 
there is an operator between two ReduceSinks that has more than one child?

Also, if there is a ReduceSink which is not followed by another ReduceSink, 
but is followed by an operator with more than one child, this prevents the 
optimization from being used, even though it shouldn't have an effect.

Also, regarding checking if the size =1, if the size 1 the next line will 
throw an exception.



trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
https://reviews.apache.org/r/2001/#comment10851

findNextChildReduceSinkOperator can return null, do you need to check for 
this?


- Kevin


On 2012-01-29 17:56:48, Yin Huai wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2001/
bq.  ---
bq.  
bq.  (Updated 2012-01-29 17:56:48)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  This optimizer exploits intra-query correlations and merges multiple 
correlated MapReduce jobs into one jobs.
bq.  
bq.  
bq.  This addresses bug HIVE-2206.
bq.  https://issues.apache.org/jira/browse/HIVE-2206
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java 
PRE-CREATION 
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
 PRE-CREATION 
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
 PRE-CREATION 
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
 PRE-CREATION 
bq.trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 
1237326 
bq.trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 
bq.trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 
1237326 
bq.trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 
1237326 
bq.trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 
1237326 
bq.trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1237326 
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java 
PRE-CREATION 
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
 PRE-CREATION 
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1237326 
bq.

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-02-10 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205654#comment-13205654
 ] 

jirapos...@reviews.apache.org commented on HIVE-2206:
-



bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   I've started reviewing this, here's my comments so far.  I'll continue 
to look over it.

I will update this patch soon. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 453
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71297#file71297line453
bq.  
bq.   Does this have to default to false, does anything break if it's true?
bq.   
bq.   Similarly, have you tried running the tests with this set to true?

I have not tried running the tests with this set to true. I will do it when I 
find a revision which can pass all unit tests (btw, any suggestion on which 
revision should I use?). In my opinion, since this optimizer is kind of 
complicated and it is still being developed, it will be safer to default it to 
false and let users to decide when to use it than default it to true.


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java,
 line 101
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line101
bq.  
bq.   It's not clear to me why we need both setRowNumber and processOp.

Since a CorrelationCompositeOperator may have multiple parents, I used a buffer 
to store the output of parents of the CorrelationCompositeOperator (shown 
processOp method). The TableScanOperator will trigger the setRowNumber method 
and then CorrelationCompositeOperator will decide the operationPathTags of this 
row based on the contents in the buffer and then forward the row in its buffer 
to its child. So, setRowNumber in here is used to evaluate the 
operationPathTags of the row in the buffer before the 
CorrelationCompositeOperator gets the new row. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java,
 lines 150-177
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line150
bq.  
bq.   Putting this code in a helper method would be better than having it 
both here and in setRowNumber.

I will do it. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java,
 line 274
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71300#file71300line274
bq.  
bq.   Does this commented out code need to be kept?

This commented out code is not needed. I will delete it. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java, line 1337
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71303#file71303line1337
bq.  
bq.   I couldn't find a CorrelationFakeReduceSinkOperator class.

CorrelationLocalSimulativeReduceSinkOperator was named as 
CorrelationFakeReduceSinkOperator. I will use the right name in the comment. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java, line 
273
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71305#file71305line273
bq.  
bq.   Tabs are bad, could you change them to spaces, at least in the new 
code your introducing.

I will change the format of my code. Thanks for letting me know.


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java,
 line 239
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line239
bq.  
bq.   I take it from this line it's a requirement that in order for this 
correlation optimization to be attempted every reduce sink has to be followed 
only by children with a single child.
bq.   
bq.   Could this be relaxed?  Could the optimization simply not be applied 
if there is an operator between two ReduceSinks that has more than one child?
bq.   
bq.   Also, if there is a ReduceSink which is not followed by another 
ReduceSink, but is followed by an operator with more than one child, this 
prevents the optimization from being used, even though it shouldn't have an 
effect.
bq.   
bq.   Also, regarding checking if the size =1, if the size 1 the next 
line will throw an exception.

Only assert op.getChildOperators().size()  0; is needed at here. Thank you 
for letting me know. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.   
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java,
 line 335
bq.   https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line335
bq.  
bq.   findNextChildReduceSinkOperator can return null, do 

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-01-29 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195805#comment-13195805
 ] 

jirapos...@reviews.apache.org commented on HIVE-2206:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
---

(Updated 2012-01-29 17:56:48.704757)


Review request for hive.


Changes
---

make the patch compatible with latest trunk (revision 1237253).


Summary
---

This optimizer exploits intra-query correlations and merges multiple correlated 
MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
https://issues.apache.org/jira/browse/HIVE-2206


Diffs (updated)
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 
1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 
1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1237326 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 

Diff: https://reviews.apache.org/r/2001/diff


Testing
---


Thanks,

Yin



 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 YSmartPatchForHive.patch, testQueries.2.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-01-20 Thread Kevin Wilfong (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190147#comment-13190147
 ] 

Kevin Wilfong commented on HIVE-2206:
-

I tried running 

explain select * from (select * from src distribute by key sort by key) a join 
src b  on a.key = b.key;

using HIVE-2206.8.r1224646.patch.txt and I get the following exception:

FAILED: Hive Internal Error: 
java.lang.ClassCastException(org.apache.hadoop.hive.ql.exec.SelectOperator 
cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator)
java.lang.ClassCastException: org.apache.hadoop.hive.ql.exec.SelectOperator 
cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator
at 
org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.findPeerReduceSinkOperators(CorrelationOptimizer.java:256)
at 
org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.process(CorrelationOptimizer.java:503)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
at 
org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:193)
at 
org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-01-20 Thread Kevin Wilfong (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190151#comment-13190151
 ] 

Kevin Wilfong commented on HIVE-2206:
-

Nevermind, sorry, it was the distribute by followed by sort by.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-01-20 Thread Kevin Wilfong (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190161#comment-13190161
 ] 

Kevin Wilfong commented on HIVE-2206:
-

The above bug is a pre-existing issue with reduce sink reduplication.

The following new exception is produced by the query:

set hive.optimize.reducededuplication=false;
explain select * from (select * from src distribute by key sort by key) a join 
src b on a.key = b.key;

FAILED: Hive Internal Error: java.lang.ArrayIndexOutOfBoundsException(1)
java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.createCorrelationCompositeReducesinkOperaotr(CorrelationOptimizerUtils.java:599)
at 
org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.applyCorrelation(CorrelationOptimizerUtils.java:365)
at 
org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:198)
at 
org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-01-20 Thread Yin Huai (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190276#comment-13190276
 ] 

Yin Huai commented on HIVE-2206:


@Kevin,
I will take a look at it. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2011-12-29 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177318#comment-13177318
 ] 

jirapos...@reviews.apache.org commented on HIVE-2206:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
---

(Updated 2011-12-29 18:50:12.277210)


Review request for hive.


Summary
---

This optimizer exploits intra-query correlations and merges multiple correlated 
MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
https://issues.apache.org/jira/browse/HIVE-2206


Diffs (updated)
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 
1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 
1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1224666 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1224666 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1224666 

Diff: https://reviews.apache.org/r/2001/diff


Testing (updated)
---


Thanks,

Yin



 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, Queries, 
 YSmartPatchForHive.patch, testQueries.1.q, testQueries.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2011-09-21 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13109753#comment-13109753
 ] 

jirapos...@reviews.apache.org commented on HIVE-2206:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
---

Review request for hive.


Summary
---

This optimizer exploits intra-query correlations and merges multiple correlated 
MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
https://issues.apache.org/jira/browse/HIVE-2206


Diffs
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1173271 
  
trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
 1173271 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationDispatchOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationFakeReduceSinkOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationManualForwardOperator.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReduceSinkOperator.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 
1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1173271 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationGenMRRedSink1.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
 PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1173271 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java 
PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationDispatchDesc.java 
PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationFakeReduceSinkDesc.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationManualForwardDesc.java
 PRE-CREATION 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReduceSinkDesc.java 
PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1173271 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCountDistinct.java
 PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1173271 
  trunk/ql/src/test/results/clientpositive/show_functions.q.out 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1173271 

Diff: https://reviews.apache.org/r/2001/diff


Testing
---

Ran all unit tests


Thanks,

Yin



 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5.patch.txt, Queries, 
 YSmartPatchForHive.patch, testQueries.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2011-09-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107866#comment-13107866
 ] 

Ashutosh Chauhan commented on HIVE-2206:


@Yin,

To overwrite current results you can do the following:
{code}
ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q -Doverwrite=true
{code}

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2011-09-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107900#comment-13107900
 ] 

Yin Huai commented on HIVE-2206:


@Ashutosh, Thanks

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2011-09-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107530#comment-13107530
 ] 

Yin Huai commented on HIVE-2206:


found a bug in the HIVE-2206.1.patch.txt. will upload a update version later. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.1.patch.txt, Queries, 
 YSmartPatchForHive.patch, testQueries.q


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >