subject:"\[jira\] \[Commented\] \(HIVE\-2206\) add a new optimizer for query correlation discovery and optimization"


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711305#comment-13711305
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  Have addressed some comments. Will address the rest of comments later.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 These OIs 
are not needed. I have removed them.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 
JoinOperators relies on the tag to function correctly.  I will add comment to 
explain why we need revert the newTag to oldTag.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Done.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Done.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Done
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Yes, I have 
changed it to numParents = getNumParent();
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Done. Since 
there is another check in initializeOp, I will throw the exception at there.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 Done

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711556#comment-13711556
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  Another check point. Will finish soon and generate new patch

INLINE COMMENTS
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250
 this function is not needed. I have deleted it.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590
 it is not used. I have deleted it.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630
 I have deleted it.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647
 I have deleted it. We can extend the scope of this optimizer in a follow-up 
jira.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79
 Done
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83
 Done

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711617#comment-13711617
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

INLINE COMMENTS
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171
 table should not be null at here. I will throw an exception when we have 
table==null
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153
 Since CommonJoinTaskDispatcher is in the phase of physical optimization, seems 
that we cannot refactor this part of code in an easy way. I suggest refactoring 
it in a follow-up jira.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join.

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711766#comment-13711766
 ] 

Phabricator commented on HIVE-2206:
---

ashutoshc has accepted the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  +1 Awesome work, Yin!
  Beautiful ascii art too : ) Finally some great comments in code. : )

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130716

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-17 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712037#comment-13712037
 ] 

Hive QA commented on HIVE-2206:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12592900/HIVE-2206.patch

{color:green}SUCCESS:{color} +1 all tests passed

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/71/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/71/console

Messages:
Executing org.apache.hive.ptest.execution.CleanupPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase

This message is automatically generated.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
 HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
 HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper:

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Phabricator (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710219#comment-13710219
 ] 

Phabricator commented on HIVE-2206:
---

ashutoshc has requested changes to the revision HIVE-2206 [jira] add a new 
optimizer for query correlation discovery and optimization.

  Minor comments, mostly around improving documentation in code.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Does 
this patch makes this necessary? Or, you added it just for completeness?
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Better to 
do it as ListObject thisRow = (ListObject) row;
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Will be 
good to add comments for all these maps. What mappings they are tracking?
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Will be 
good to add some ascii art showing an example of such a plan.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Is this 
necessary?
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 I understand 
this but it will be confusing for someone reading this comment for first time 
because before this patch RS operator is always in map side. We need to reword 
this so its easier to read.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Can you add a 
comment when this boolean will be true and when it will be false.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Lets throw an 
exception here. if (childOperatorsArray.length != 1) throw new HiveException 
(Expected number of children is 1. Found :  + childOperatorsArray.length)
  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 This should not 
be required. You can always get all the values of enum by using valueOf() 
method on enum.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 It will 
be good to add javadoc for this explaining why we should leave it as it is?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45
 It will be good to add javadoc for this class.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation:

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Shane Pratt (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710227#comment-13710227
 ] 

Shane Pratt commented on HIVE-2206:
---

Thank you for your message.

I will be traveling the next several days so there may be a delay in my 
response to your email.

If you need to reach me now, please call the number below.  Otherwise, I will 
respond to you as soon as I can.


Shane
512-590-3925




 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Phabricator (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710608#comment-13710608
 ] 

Phabricator commented on HIVE-2206:
---

ashutoshc has commented on the revision HIVE-2206 [jira] add a new optimizer 
for query correlation discovery and optimization.

  Few more comments. See which of these apply. If they doesn't apply, feel free 
to ignore.

INLINE COMMENTS
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250
 What does this function do?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171
 Will be good to add comment stating when table == null
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153
 It seems like lot of logic here is shared with CommonJoinTaskDispatcher. It 
will be good to have that refactored so that its reusable here.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284
 Seems like this method always return true. So, this is not required.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271
 Add a comment saying that tree walking is done and now you will apply 
transformations which you have detected.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590
 Do we really need hasBeenRemoved() check?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597
 getKeyCols().size() is not a good check. I will recommend to test explictly 
for operators which we are supporting right now.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630
 Do we still need this?
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647
 We should do jobFlowCorrelation as another pass in transform().
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526
 It will be good to add some ascii art which shows what tree structure we are 
returning from this function.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368
 It will good to add javadoc for this method.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453
 Didn't understand this comment. Probably we can word it better.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 I dont think 
we need to revert to oldTag here. We can keep using newTag.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 Doesnt 
look like you are using these OIs. Probably we can get rid of these.
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 It will be 
good to add comments for whats the intent of this for loop.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Why is it 
called NumOriginalParents? can it be just numOfParents
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 There is 
already a forward in Demux, this should not be needed.
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 You dont need 
this constructor
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83
 Looks like this map is not used anymore, lets get rid of this.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79
 It will be good to add comments about what this method is intending to do
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67
 This method straight away calls another method. We can eliminate this one.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt,

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-16 Thread Phabricator (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710687#comment-13710687
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  Add an explanation on startGroup. Will start to address the rest of comments 
tomorrow.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Since 
we can have a operator tree with multiple JoinOperators and GroupByOperators 
inside, we need to propagate the startGroup to all operators in the operator 
tree. For queries which are not optimized by this patch, we can have at most 1 
JoinOperator (at the beginning of the reduce-side) and 2 GroupByOperators (1 at 
the beginning of the reduce-side one and 1 hash mode one just before 
FileSinkOperator). This change will not affect those operators.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Please 
see my reply to the same change made in CommonJoinOperator
  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 Seems an enum 
does not have a method to return a list of values with the type of string.

REVISION DETAIL
  https://reviews.facebook.net/D11097

BRANCH
  HIVE-2206-3671-20130711

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
 HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
 HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
 HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
 HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, 
 HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, 
 HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, 
 HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q, 
 YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-07-02 Thread Shane Pratt (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698115#comment-13698115
]

Shane Pratt commented on HIVE-2206:
---

Thank you for your message.

I am out of the office on vacation for the remainder of the week. If this is
an emergency, please call me at the number below.

Otherwise, I'll respond to your message when I return.

Shane
512-590-3925

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt,
HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt,
HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt,
HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt,
HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch,
HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch,
HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch,
HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch,
HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch,
HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch,
testQueries.2.q, YSmartPatchForHive.patch

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-07 Thread Phabricator (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678116#comment-13678116
 ] 

Phabricator commented on HIVE-2206:
---

brock has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 Looks like it's 
there because ArrayList defines a clone() method.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 I agree 
that hive does this often. I don't mean to suggest you should fix this in all 
of hive in the patch but let's not add any additional printStackTraces. I see 
one additional new printStackTrace in your patch. Would you mind removing that 
one as well?

REVISION DETAIL
  https://reviews.facebook.net/D11097

To: JIRA, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-06 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677381#comment-13677381
]

Yin Huai commented on HIVE-2206:

update the diff at https://reviews.facebook.net/D11097. Fixed two bugs. All
unit test pass when the optimizer is turned off by default. I am evaluating if
there is any issue when the optimizer is turned on by default.

add a new optimizer for query correlation discovery and optimization

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-06 Thread Phabricator (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1360#comment-1360
 ] 

Phabricator commented on HIVE-2206:
---

brock has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

  I was just casually reading this patch and noted a few items.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 If we 
are throwing the exception do we need to print the exception? Also, this should 
be logged not printed.
  ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 We should be 
returning list of collection not arraylist no? There are a few other 
occurrences of this.

REVISION DETAIL
  https://reviews.facebook.net/D11097

To: JIRA, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-06 Thread Phabricator (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677791#comment-13677791
 ] 

Phabricator commented on HIVE-2206:
---

yhuai has commented on the revision HIVE-2206 [jira] add a new optimizer for 
query correlation discovery and optimization.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 Did not 
notice it before. I copied the code of closeOp. I do not think we need to print 
the exception. I will change this class. Also, seems printing the exception 
also appear in lots of other places. If we want to need to remove all of them, 
we need a separate jira.
  ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 MuxOperator is 
used to replace ReduceSinkOperators in an MR job optimized by this optimizer. I 
basically follow the code of ReduceSinkDesc. Seems the reason that ArrayList is 
used is for clone. I will leave ArrayList at here right now.

REVISION DETAIL
  https://reviews.facebook.net/D11097

To: JIRA, yhuai
Cc: brock


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-05 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676441#comment-13676441
]

Ashutosh Chauhan commented on HIVE-2206:

In your testcases, some of the patterns you have (e.g., like Join followed by
GBY) on same keys, I assume reducesink reduplication optimization will already
take care of it such that it will generate only 1 MR job. Is that correct? Is
it that for all of your testcases reducesink dedup optimization will not fire.
If its former, than it will be good to identify which of those cases are
already taken care by RS dedup. If its latter, than it will be good to know why
reducesink dedup optimization is not kicking in for those.

add a new optimizer for query correlation discovery and optimization

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-05 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676462#comment-13676462
]

Yin Huai commented on HIVE-2206:

RS dedup is on by default. So the explain without CorrelationOptimizer should
be optimized by RS dedup. But, seems that it does not fire in any of my cases.
Will take a look at it later.

add a new optimizer for query correlation discovery and optimization

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-05 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676643#comment-13676643
 ] 

Yin Huai commented on HIVE-2206:


Just found I need to set false for both hive.auto.convert.join and 
hive.auto.convert.join.noconditionaltask to let RS dedup to work on cases with 
join. I just tried two cases. It works on 
{code:sql}
SELECT x.key AS key, count(1) AS cnt FROM src1 x JOIN src y ON (x.key = y.key) 
GROUP BY x.key
{\code}, and it does work on
{code}
SELECT xx.key, xx.cnt, yy.key, yy.cnt
FROM
(SELECT x.a as key, count(*) AS cnt FROM src x group by x.a) xx
JOIN
(SELECT y.a as key, count(*) AS cnt FROM src1 y group by y.a) yy
ON (xx.key=yy.key);
{\code}

I suggest that we let CorrelationOptimizer to handle cases involving join 
because it supports more cases and has included needed mechanisms.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 HIVE-2206.D11097.1.patch, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-06-04 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675569#comment-13675569
]

Yin Huai commented on HIVE-2206:

HIVE-2206.D11097.1.patch is the latest patch for the trunk. I have heavily
refactored my code. Here are major changes.
# If multiple operation paths share the same input table, I just use a single
TableScanOperator and add the bottom operators of these paths as children of
this common TableScanOperator. I do not do any deduplication of common columns
because deduplication will significantly make the code more complicated and may
introduce more problems. If we want to do deduplication, I suggest to tackle it
later in a followup work.
# Without deduplicating columns, the dispatcher at the reduce side has less
work to do and some queries involving self join can be optimized in the current
version.
# The fake ReduceSinkOperator (CorrelationLocalSimulativeReduceSinkOperator...
I will change the name later) does not do serialization and deserialization as
appearing in the previous one.
# New test cases are added.
# I also refactor the code ReduceSinkDeDupplication since CorrelationOptimizer
can reuse some methods introduced by ReduceSinkDeDupplication. [~navis] can you
take a look at it and see if my changes make sense?

I will run all unit tests soon and will also add more comments.

btw, there is a issue in correlationoptimizer2.q. Optimized plans cannot
generate rows that both join keys (from the left table and right table) are
null values for outer joins. I am looking at it

add a new optimizer for query correlation discovery and optimization

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-02-11 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575844#comment-13575844
 ] 

Yin Huai commented on HIVE-2206:


[~ashutoshc] have you got a time to look at the patch?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-18 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557411#comment-13557411
]

Ashutosh Chauhan commented on HIVE-2206:

I am having second thoughts on cloning. Cloning graphs (like query plan) or
dense structures (like ParseContext) is fraught with perils. Its likely that
cloning will require new code and arguably have hard to detect bugs, since we
need to track down every single pointer and clone all the way through. I think
to avoid such issues and for simplicity, we can drop the cloning idea. The
feature is anyway behind the config option which is off default, so query-plan
will be modified only for the users who turn the flag on.
Yin, if you have addressed my other comments, can you update the patch on RB
and upload here on jira, I will take another look at it.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt,
HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt,
HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt,
HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt,
testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-18 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557416#comment-13557416
]

Ashutosh Chauhan commented on HIVE-2206:

Oh.. I see that you have already updated RB and jira. I will take a look at it
soon.

add a new optimizer for query correlation discovery and optimization

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-16 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555155#comment-13555155
]

Yin Huai commented on HIVE-2206:

So, if a Map join is involved in a plan and the output of this join will be
consumed by another subsequent operator, we should leave the Map join in the
Map phase of the subsequent operator instead of cutting Map join to a separate
MR job. Is my understanding correct? If so, I think that we should change the
process on generating Map join Operators. After generating a Map join operator,
we do not insert a FileSinkOperator after the Map Join. Seems that Rule 11 in
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genMapRedTasks is the rule for
generating a separate MR job for a Map Join.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt,
HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt,
HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-16 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555491#comment-13555491
 ] 

Ashutosh Chauhan commented on HIVE-2206:


I ran my tests with HIVE-3784 and I got a single MR job for my query (i.e., 
mapjoin followed by group-by on different keys) gets you a single MR job. Thats 
cool.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
 HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
 HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
 HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
 testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-15 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554712#comment-13554712
]

Ashutosh Chauhan commented on HIVE-2206:

I did some testing of this on our use-cases. Lets say you have two tables:

{code}
create table t1 (a int, b int);
create table t2 (c int, d int);
select a from ( select * from t1 join t2 on (t1.a = t2.c) )e group by a;
select a from ( select/*+ MAPJOIN(t2) */ * from t1 join t2 on (t1.a = t2.c) )e
group by a;
{code}

Now, ysmart is able to optimize first query fine, where it fuses 2MR jobs in 1
MR, since join and group-by has same key.
However this doesn't work with 2nd query which has mapjoin. This results in 2
MR job. This is especially important since in map-join case you don't need the
condition of join-key being same as groupby key, which is *very* important. In
our use-cases, we have observed its rarely the case that join and group-by is
on same key. But, in most cases we are able to utilize map-join, since data
which we are joining on is small enough. And than subsequent group-by which is
on different key can happen on the reduce side of this single MR job.
Any thoughts on how this could be achieved ?
Though, it looks like for this use-case we don't need ysmart since we don't
need to detect any correlation. We can walk on query plan and can always fuse
map-side join followed by group-by (i.e., no need to detect any correlation
among inputs or to detect whether join and group-by keys are same).

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt,
HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt,
HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-13 Thread Liu Zongquan (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552281#comment-13552281
]

Liu Zongquan commented on HIVE-2206:

[~yhuai] I have a question that why not release a patch upon a stable hive
release, e.g,branch hive-0.8-r2. Actually I found that the r1410581 is not a
stable revision and even I can't run through ant test -Dtestcase=TestCliDriver
-Dqfile=show_functions.q -Doverwrite=true on this revision. So, if this patch
is based on a stable version, espectially a stable branch, then your honor work
will benefit more people. Even so ,just a suggestion.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt,
HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt,
HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-13 Thread David Inbar (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552282#comment-13552282
]

David Inbar commented on HIVE-2206:
---

I will be on vacation through January 14th, but will be checking email and
voicemail periodically.

For all time-critical items, please call my mobile phone.

Many thanks,
David

NOTICE: All information in and attached to this email may be proprietary,
confidential, privileged and otherwise protected from improper or erroneous
disclosure. If you are not the sender's intended recipient, you are not
authorized to intercept, read, print, retain, copy, forward, or disseminate
this message.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt,
HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt,
HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-13 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552290#comment-13552290
 ] 

Ashutosh Chauhan commented on HIVE-2206:


If Yin wants to provide a patch against a stable (or any) branch, thats his 
choice. But, for patch to get committed it needs to get committed on trunk 
first.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547982#comment-13547982
 ] 

Hudson commented on HIVE-2206:
--

Integrated in Hive-trunk-hadoop2 #54 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/54/])
HIVE-2206:add a new optimizer for query correlation discovery and 
optimization (Yin Huai via He Yongqiang) (Revision 1392105)

 Result = ABORTED
heyongqiang : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1392105
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* 
/hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
* /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Liu Zongquan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548809#comment-13548809
 ] 

Liu Zongquan commented on HIVE-2206:


[~yhuai] Thanks so much!

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549298#comment-13549298
 ] 

Ashutosh Chauhan commented on HIVE-2206:


This patch looks useful, especially since once we have this in, it will open up 
other optimization possibilities. I have left some comment on 
https://reviews.apache.org/r/7126/ 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-09 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549301#comment-13549301
 ] 

Ashutosh Chauhan commented on HIVE-2206:


Also, can this work enable or facilitate implementation of optimization which 
is getting discussed on HIVE-3773 ?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-07 Thread Liu Zongquan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545984#comment-13545984
 ] 

Liu Zongquan commented on HIVE-2206:


If I plan to merge HIVE-2206 into the hive source code, which branch should I 
use? Can someone tell me?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2013-01-07 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546150#comment-13546150
 ] 

Yin Huai commented on HIVE-2206:


[~liuzongquan] The latest patch was developed based on hive trunk revision 
1410581.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-28 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505495#comment-13505495
]

Yin Huai commented on HIVE-2206:

[~cwsteinbach] I am not sure if unit tests in Hive are comprehensive enough. If
not, it might be better that we turn on this optimizer by default in future
after we can use more queries to test it.

I just tested all unit tests with an enabled correlation optimizer. Because, if
map side aggregation is on, correlation optimizer also requires regular reduce
side aggregation to be generated, if cube or rollup is used in the query,
error message 10209
(org.apache.hadoop.hive.ql.ErrorMsg.HIVE_GROUPING_SETS_AGGR_NOMAPAGGR) will be
thrown. Seems HIVE-3508 can solve this issue. Except this issue, a few query
plans need to be re-generated because of changing operator ids.

This jira has taken a long time. Can we wrap it up and I will start to work on
follow-up jiras.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt,
HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt,
HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread Carl Steinbach (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500469#comment-13500469
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yin: The correlation optimizer is only enabled for a small set of new 
CliDriver tests. If I enable the correlation optimizer by default, which of the 
existing CliDriver tests are expected to fail?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread David Inbar (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500474#comment-13500474
]

David Inbar commented on HIVE-2206:
---

I will be on vacation through Friday Nov 23rd, but will be checking email and
voicemail periodically.

For all time-critical items, please call my mobile phone.

Many thanks,
David

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt,
HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,
HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500499#comment-13500499
]

Yin Huai commented on HIVE-2206:

[~cwsteinbach]
If the optimizer is enabled by default, based on my last tests, only
auto_join26.q is expected to fail, because it will be optimized by correlation
optimizer. But, except the query plan, the query result of auto_join26.q is
correct. Also, once I finished HIVE-3671 (I am working on it right now), the
failure of auto_join26.q should be eliminated.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt,
HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,
HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-19 Thread Carl Steinbach (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500626#comment-13500626
 ] 

Carl Steinbach commented on HIVE-2206:
--

I'm surprised that auto_join26 is the only test that fails due to different 
EXPLAIN output. Is that because this optimization doesn't affect the queries in 
most tests, or because we don't consistently call EXPLAIN in the tests?

What is preventing us from enabling this by default right now?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-18 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499858#comment-13499858
]

Yin Huai commented on HIVE-2206:

[~namit]
Sure. I just took a look at the code. Seems that once I get all content
summaries of input table, I can make the guess on if join auto resolver will
work for join operators on input tables. Because, as far as I know, existing
util functions on retrieving content summaries (called after logical
optimization) cannot be used directly at here, I need to write some util
functions to get sizes of input tables. I will start to work on this asap.
Also, although HIVE-3671 seems not hard to do, but it is not a quick fix. I
suggest we track this work in a separate jira.

[~cwsteinbach]
Have you got time to look at current patch? Any comment?

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt,
HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,
HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-14 Thread Namit Jain (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497756#comment-13497756
]

Namit Jain commented on HIVE-2206:
--

It would be a good idea to get HIVE-3671 in this patch.
With HIVE-3671, the functionality will be much more useful to the whole
community.
[~yhuai], can you investigate getting HIVE-3671 as part of this patch, and see
how much
work is it ? Based on that, we can proceed.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt,
HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,
HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496399#comment-13496399
 ] 

He Yongqiang commented on HIVE-2206:


+1, i will commit after tests pass.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread Carl Steinbach (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496521#comment-13496521
]

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: Can you please hold off on committing while I take another look?
Thanks.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt,
HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,
HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496528#comment-13496528
 ] 

He Yongqiang commented on HIVE-2206:


@carl, you can go ahead comment, huai will address them in a sperate diff. 

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496529#comment-13496529
]

He Yongqiang commented on HIVE-2206:

@Carl, keep in mind that you already months of time to comment. So maybe
addressing your comments in new jiras will make more sense.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt,
HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,
HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread Carl Steinbach (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496658#comment-13496658
]

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: Please hold off on committing this for a day. Thanks.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt,
HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,
HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496697#comment-13496697
 ] 

He Yongqiang commented on HIVE-2206:


okay, i will target commit it this weekend or earlier next week.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-13 Thread Carl Steinbach (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496701#comment-13496701
 ] 

Carl Steinbach commented on HIVE-2206:
--

Thanks!

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-05 Thread Yin Huai (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13490681#comment-13490681
]

Yin Huai commented on HIVE-2206:

[~namit]
Sure. I created the umbrella jira (HIVE-3667) for all work related to
correlation optimizer and also created several follow-up jiras as sub-tasks.
You can also add other sub-tasks into that jira.

add a new optimizer for query correlation discovery and optimization

Key: HIVE-2206
URL: https://issues.apache.org/jira/browse/HIVE-2206
Project: Hive
Issue Type: New Feature
Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
Attachments: HIVE-2206.10-r1384442.patch.txt,
HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
HIVE-2206.17-r1404933.patch.txt, HIVE-2206.1.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt,
HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt,
HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch

This issue proposes a new logical optimizer called Correlation Optimizer,
which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The
paper and slides of YSmart are linked at the bottom.
Since Hive translates queries in a sentence by sentence fashion, for every
operation which may need to shuffle the data (e.g. join and aggregation
operations), Hive will generate a MapReduce job for that operation. However,
for those operations which may need to shuffle the data, they may involve
correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its
child nodes if it has the same partition key as that child node.
The current implementation of correlation optimizer only detect correlations
among MR jobs for reduce-side join operators and reduce-side aggregation
operators (not map only aggregation). A query will be optimized if it
satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side
aggregation operator which have JFC with all of its parents MR jobs (TCs will
be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not
intermediate tables generated by sub-queries); and
# No self join is involved in those correlated MR jobs.
Correlation optimizer is implemented as a logical optimizer. The main reasons
are that it only needs to manipulate the query plan tree and it can leverage
the existing component on generating MR jobs.
Current implementation can serve as a framework for correlation related
optimizations. I think that it is better than adding individual optimizers.
There are several work that can be done in future to improve this optimizer.
Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves
intermediate tables; and
# Optimize queries involving self join.
References:
Paper and presentation of YSmart.
Paper:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-11-04 Thread Namit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13490432#comment-13490432
 ] 

Namit Jain commented on HIVE-2206:
--

[~yhuai], can you file follow-up jiras for the cases that dont work with this 
optimization ?
It would be good to link them along with this jira. Adding them in the wiki 
would be useful too for tracking.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.17-r1404933.patch.txt, HIVE-2206.1.patch.txt, 
 HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, 
 HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, 
 HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 This issue proposes a new logical optimizer called Correlation Optimizer, 
 which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
 job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The 
 paper and slides of YSmart are linked at the bottom.
 Since Hive translates queries in a sentence by sentence fashion, for every 
 operation which may need to shuffle the data (e.g. join and aggregation 
 operations), Hive will generate a MapReduce job for that operation. However, 
 for those operations which may need to shuffle the data, they may involve 
 correlations explained below and thus can be executed in a single MR job.
 # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
 input relation sets are not disjoint;
 # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
 have not only input correlation, but also the same partition key;
 # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
 child nodes if it has the same partition key as that child node.
 The current implementation of correlation optimizer only detect correlations 
 among MR jobs for reduce-side join operators and reduce-side aggregation 
 operators (not map only aggregation). A query will be optimized if it 
 satisfies following conditions.
 # There exists a MR job for reduce-side join operator or reduce side 
 aggregation operator which have JFC with all of its parents MR jobs (TCs will 
 be also exploited if JFC exists);
 # All input tables of those correlated MR job are original input tables (not 
 intermediate tables generated by sub-queries); and 
 # No self join is involved in those correlated MR jobs.
 Correlation optimizer is implemented as a logical optimizer. The main reasons 
 are that it only needs to manipulate the query plan tree and it can leverage 
 the existing component on generating MR jobs.
 Current implementation can serve as a framework for correlation related 
 optimizations. I think that it is better than adding individual optimizers. 
 There are several work that can be done in future to improve this optimizer. 
 Here are three examples.
 # Support queries only involve TC;
 # Support queries in which input tables of correlated MR jobs involves 
 intermediate tables; and 
 # Optimize queries involving self join. 
 References:
 Paper and presentation of YSmart.
 Paper: 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
 Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-20 Thread alex gemini (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480878#comment-13480878
 ] 

alex gemini commented on HIVE-2206:
---

Did this jira have a short version description? I know a join followed by group 
is optimized like pipeline, what else we may want to add to wiki?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-20 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480881#comment-13480881
 ] 

Yin Huai commented on HIVE-2206:


[~gemini5201314]
I do not have a short version description right now. Let me write one and 
create a wiki page.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-01 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466937#comment-13466937
 ] 

He Yongqiang commented on HIVE-2206:


I will be on vacation this whole week. Given this is a very big diff, I will 
keep this open for another one week or two for more comments. 


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-10-01 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13467067#comment-13467067
 ] 

Yin Huai commented on HIVE-2206:


I just found I can remove the first phase of this optimizer. Apparently there 
were changes in the trunk, so I do not need to save original ColumnExprMap and 
OpParseCtx. I have removed unnecessary code and are running tests. Will update 
the patch later.


 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466552#comment-13466552
 ] 

He Yongqiang commented on HIVE-2206:


All tests passed for me.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread Carl Steinbach (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466576#comment-13466576
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: I don't see a +1 vote in this JIRA. According to the project bylaws 
(https://cwiki.apache.org/confluence/display/Hive/Bylaws) this patch should not 
have been committed. Please back this patch out. Thanks.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466580#comment-13466580
 ] 

He Yongqiang commented on HIVE-2206:


I commented that all tests passed.

ok, +1.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

2012-09-30 Thread Carl Steinbach (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466582#comment-13466582
 ] 

Carl Steinbach commented on HIVE-2206:
--

@Yongqiang: Sorry, but that's not the way it works. You vote +1 first, wait 24 
hours, and then commit the patch. This is all covered in the project bylaws. 
Please revert this patch. Thanks.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466581#comment-13466581
 ] 

He Yongqiang commented on HIVE-2206:


@Carl, btw, i did mentioned a few times on the comments that i am planing to 
commit this one.

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization


[ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466584#comment-13466584
 ] 

He Yongqiang commented on HIVE-2206:


I did not see a 24 hours waiting on the bylaw page?

 add a new optimizer for query correlation discovery and optimization
 

 Key: HIVE-2206
 URL: https://issues.apache.org/jira/browse/HIVE-2206
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: He Yongqiang
Assignee: Yin Huai
 Attachments: HIVE-2206.10-r1384442.patch.txt, 
 HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
 HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
 HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, 
 HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, 
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, 
 HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch


 reference:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization