[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2013-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548074#comment-13548074
 ] 

Hudson commented on HIVE-3086:
--

Integrated in Hive-trunk-hadoop2 #54 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/54/])
HIVE-3086. Skewed Join Optimization. njain via kevinwilfong (Revision 
1386996)

 Result = ABORTED
kevinwilfong : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1386996
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SkewJoinOptimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeColumnDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeConstantDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeFieldDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt1.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt10.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt11.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt12.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt13.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt14.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt15.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt16.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt17.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt18.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt19.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt2.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt20.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt3.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt4.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt5.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt6.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt7.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt8.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt9.q
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt10.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt11.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt12.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt13.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt14.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt15.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt16.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt17.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt18.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt19.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt20.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt5.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt6.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt7.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt9.q.out


 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Fix For: 0.10.0

 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed 

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457754#comment-13457754
 ] 

Hudson commented on HIVE-3086:
--

Integrated in Hive-trunk-h0.21 #1679 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/1679/])
HIVE-3086. Skewed Join Optimization. njain via kevinwilfong (Revision 
1386996)

 Result = FAILURE
kevinwilfong : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1386996
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SkewJoinOptimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeColumnDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeConstantDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeFieldDesc.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt1.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt10.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt11.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt12.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt13.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt14.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt15.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt16.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt17.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt18.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt19.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt2.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt20.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt3.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt4.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt5.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt6.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt7.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt8.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt9.q
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt10.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt11.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt12.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt13.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt14.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt15.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt16.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt17.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt18.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt19.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt20.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt5.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt6.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt7.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt9.q.out


 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Fix For: 0.10.0

 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed 

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457885#comment-13457885
 ] 

Yin Huai commented on HIVE-3086:


a quick question. Can you let me know where the join operator for skewed keys 
is converted to a map join operator? Thanks!

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Fix For: 0.10.0

 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-18 Thread Nadeem Moidu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457891#comment-13457891
 ] 

Nadeem Moidu commented on HIVE-3086:


@Yin Huai: The join operator for skewed keys is automatically converted to map 
join when the map join optimization is performed. That is not included in this 
patch.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Fix For: 0.10.0

 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457939#comment-13457939
 ] 

Yin Huai commented on HIVE-3086:


@Nadeem: Thanks! Just found another question. It seems that the large table 
(which has the skewed keys) will be scanned twice. Is my understanding correct?

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Fix For: 0.10.0

 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-18 Thread Nadeem Moidu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457947#comment-13457947
 ] 

Nadeem Moidu commented on HIVE-3086:


Yes, in the current implementation, both the tables will be scanned twice. This 
can be avoided if the table scan operator is not replicated and has multiple 
children instead, but this optimization has not been done in this patch.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Fix For: 0.10.0

 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-18 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13458005#comment-13458005
 ] 

Namit Jain commented on HIVE-3086:
--

[~yhuai], right now both the input and output tables will be scanned twice.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Fix For: 0.10.0

 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-17 Thread Kevin Wilfong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457188#comment-13457188
 ] 

Kevin Wilfong commented on HIVE-3086:
-

+1 This looks good to me now.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-09 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13451760#comment-13451760
 ] 

Namit Jain commented on HIVE-3086:
--

addressed Kevin's comments

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, 
 hive.3086.4.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-09-07 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13450442#comment-13450442
 ] 

Namit Jain commented on HIVE-3086:
--

addressed Nadeem's comments

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-08-31 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445823#comment-13445823
 ] 

Namit Jain commented on HIVE-3086:
--

I have not run the existing tests yet - just started them.
Have verified the outputs of the new tests that were added as part of this patch

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Namit Jain
 Attachments: hive.3086.1.patch


 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-08-29 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444687#comment-13444687
 ] 

Namit Jain commented on HIVE-3086:
--

https://reviews.facebook.net/D5043

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Namit Jain

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-08-27 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13442372#comment-13442372
 ] 

Carl Steinbach commented on HIVE-3086:
--

@Namit: Is this a work in progress?

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Namit Jain

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13442512#comment-13442512
 ] 

Namit Jain commented on HIVE-3086:
--

yes

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Namit Jain

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-08-05 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428864#comment-13428864
 ] 

Namit Jain commented on HIVE-3086:
--

@Alex, The problem that you mentioned can be handled by 
https://issues.apache.org/jira/browse/HIVE-3286.

Navis is working on that. These are independent strategies and can be applied.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-08-04 Thread alex gemini (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428568#comment-13428568
 ] 

alex gemini commented on HIVE-3086:
---

@Yongqiang  We don't need hint here,the above example is just for clarify.The 
main point here is if some key is skewed ,just mixed this key with 
another low selectivity key like primary key.Use this composite key as input 
for hash partition.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-08-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427239#comment-13427239
 ] 

Namit Jain commented on HIVE-3086:
--

@Yongqiang, the current skew join does the optimization after most of the 
damage has already been done.
The reducer detects that a particular key is skewed, and then processes that 
key in a separate MR job.

However, in this approach, we are planning to know about the skewed keys before 
hand (stored in the metastore),
and then use them to do a map-join for the skewed keys and a normal join for 
the other keys. This does require
some change from the user (the user needs to store the skewed keys in the 
metastore). However, this approach can
be very good for repetitive workloads - similar queries running every day for 
similar data. Most probably, the skew
does not change every day. The skew can be calculated periodically.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-06-26 Thread alex gemini (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401262#comment-13401262
 ] 

alex gemini commented on HIVE-3086:
---

the design is very complicated IMO,what if we have a big table logs and a small 
table users, table users have a column 'age', if we have issue a query skewed 
by age which we can't pre-partition the big table.this design didn't handle 
it,right? I guess what we want is customer partition at runtime,for the above 
example, we need customer partition(or some hint)or tell the query plan we want 
to partition the users table at 'userid,age' column and also partition the logs 
table at 'userid' column, the partition number for same userid for two table 
need to be same for further join.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-06-26 Thread Nadeem Moidu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401530#comment-13401530
 ] 

Nadeem Moidu commented on HIVE-3086:


@Alex, I'm sorry but your question is not very clear. Can you please give the 
exact schema, query and the skewed keys that you have in mind. Here are some 
comments based on what I understood from your question:
1. The bottleneck mentioned is only when the join key is skewed, so only that 
case is handled.
2. If a table is small, we have map-join to handle that.
3. We are not doing any pre-partioning.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-06-26 Thread alex gemini (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401939#comment-13401939
 ] 

alex gemini commented on HIVE-3086:
---

for a big table logs(userid,region,timestamps,url) which has more than 10 
billion record,a middle size table users(userid,age) which has 10 million 
records, if there is a query :
 select count(userid) from logs a ,users b where a.userid=b.userid group by 
b.age.
let's say age 18-25 have more than 50% of total records and age 40-60 have only 
5% of records, age 25-50 have rest.
what we defined skewed is always by our query ,in this case skewed key is 
age,we can't always assume two table are skewed by join key,right?
another example : select count(userid),to_date(timestamps,'MMDD'),age from 
logs where timestamps  2011-12-01 and timestamps  2011-12-31 and age25 and 
age18.
because the Christmas,records in 2011-12-25 to 2011-12-31 maybe have more 
records than other day in this month(this query particular assume age is not 
skewed for the purpose discussion).
since hive user hash partition ,let's say 6 reduce,then 2011-12-24 and 
2011-12-30 will go into same reduce which cause one reduce process much more 
records than others.

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-06-26 Thread alex gemini (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401945#comment-13401945
 ] 

alex gemini commented on HIVE-3086:
---

maybe what we want is dynamic change our partition key base on our hints,for 
examples:
select /**+ partitions logs(userid,timestamps),users(id) */ 
count(userid),to_date(timestamps,'MMDD'),age from logs where timestamps  
2011-12-01 and timestamps  2011-12-31 and age25 and age18.
this time we will partition logs by userid and timestamps . so for records in 
2011-12-24 it will hash to six reduce instead of one, each reduce will process 
same amout of records.
another query example:
select /**+ partitions logs(userid),users(id,age) */ 
count(userid),to_date(timestamps,'MMDD'),age from logs where timestamps  
2011-01-01 and timestamps  2011-12-31 and age25 and age18.
this time timestamp is not primary skewed key, we change our parititon key to 
age.
In the ListBucketing desing, 
create table T (c1 string, c2 string, c3 string) skewed by (c1, c2) on (('x1', 
'x2'), ('y1', 'y2'));
we need assume we know tables skewed by some column,but data is always skewed 
and we can't list every skewed value combination.


 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3086) Skewed Join Optimization

2012-06-26 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401960#comment-13401960
 ] 

He Yongqiang commented on HIVE-3086:


'hint' by user has been proven not very useful. Automatically detecting skewed 
keys, like what the current skew join processor is doing now, will make it more 
powerful and useful.

@Nadeem, can you add more details to the wiki about the differences between the 
existing one and the one you are working on. The current one can not process 
the case where a same join key is skewed in more than one table. Are you 
targeting those cases? Also there are some problems with existing skew join 
opt, can you also try to fix those as part of your project?

 Skewed Join Optimization
 

 Key: HIVE-3086
 URL: https://issues.apache.org/jira/browse/HIVE-3086
 Project: Hive
  Issue Type: New Feature
Reporter: Nadeem Moidu
Assignee: Nadeem Moidu

 During a join operation, if one of the columns has a skewed key, it can cause 
 that particular reducer to become the bottleneck. The following feature will 
 address it:
 https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira