[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-30 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724477#comment-13724477
 ] 

Gunther Hagleitner commented on HIVE-4827:
--

This looks really good. Just a few smaller things:

- Can you add tests for cascading mapjoins? Something like 5 joins that will 
produce two groups 3 + 2.
- Can you add test to show that setting the threshold to 0 effectively turns 
off the optimization?

Finally, I think it makes sense to remove the old flag since it no longer 
applies. But can you fill out the release notes section of this ticket and 
describe how 'optimize.mapjoin.mapreduce' is superseded by threshold = 0 or 
noconditionaltask?

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch, HIVE-4827.3.patch, 
 HIVE-4827.4.patch, HIVE-4827.5.patch, HIVE-4827.6.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721062#comment-13721062
 ] 

Yin Huai commented on HIVE-4827:


Updated patch has been uploaded to RB. Mark it as PA to trigger precommit tests.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch, HIVE-4827.3.patch, 
 HIVE-4827.4.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-26 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721534#comment-13721534
 ] 

Hive QA commented on HIVE-4827:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12594491/HIVE-4827.5.patch

{color:green}SUCCESS:{color} +1 2653 tests passed

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/202/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/202/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.CleanupPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch, HIVE-4827.3.patch, 
 HIVE-4827.4.patch, HIVE-4827.5.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-25 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720184#comment-13720184
 ] 

Yin Huai commented on HIVE-4827:


.3 is not for review. I am currently running unit tests.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch, HIVE-4827.3.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719012#comment-13719012
 ] 

Yin Huai commented on HIVE-4827:


With this work, we can merge mergeMapJoinTaskWithChildMapJoinTask with 
mergeMapJoinTaskWithMapReduceTask and drop the flag of 
'hive.optimize.mapjoin.mapreduce'.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719026#comment-13719026
 ] 

Yin Huai commented on HIVE-4827:


Also, with this work, different Map tasks (operation paths) can load different 
set of hash tables. So, we should check the size of those hash tables per 
operation path.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715009#comment-13715009
 ] 

Hive QA commented on HIVE-4827:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12593451/HIVE-4827.2.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 2647 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketmapjoin6
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/123/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/123/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.CleanupPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests failed with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715411#comment-13715411
 ] 

Yin Huai commented on HIVE-4827:


Cannot reproduce the failed test case 
(TestMinimrCliDriver.testCliDriver_bucketmapjoin6) in my laptop. I will use 
another machine to test it.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-22 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715424#comment-13715424
 ] 

Brock Noland commented on HIVE-4827:


Yeah that test is flaky. I just submitted another run of the precommit.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715438#comment-13715438
 ] 

Yin Huai commented on HIVE-4827:


Thanks!

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715968#comment-13715968
 ] 

Yin Huai commented on HIVE-4827:


Found an issue related to partitioned big tables. Will fix the issue later.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-22 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715985#comment-13715985
 ] 

Gunther Hagleitner commented on HIVE-4827:
--

Comments on RB. Nice optimization.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-21 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714736#comment-13714736
 ] 

Edward Capriolo commented on HIVE-4827:
---

Because two map reduce jobs are becoming a single one now does this mean that 
there is a greater chance of the map task failing with oom conditions?

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714888#comment-13714888
 ] 

Yin Huai commented on HIVE-4827:


Review board: https://reviews.apache.org/r/12795/

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714648#comment-13714648
 ] 

Yin Huai commented on HIVE-4827:


I am running tests. Will see if there is any result which needs to be updated 
or any bug in my patch.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4827.1.patch


 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key1 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key1) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714292#comment-13714292
 ] 

Yin Huai commented on HIVE-4827:


Actually, I only need to change 
CommonJoinTaskDispatcher.mergeMapJoinTaskWithMapReduceTask. No need to change 
MapJoinOperator. Just make my patch work for a simple union query shown in 
description. Should be able to have a patch uploaded soon.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai

 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 set hive.optimize.correlation=true;
 set hive.auto.convert.join=true;
 set hive.optimize.mapjoin.mapreduce=true;
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key2) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.
 Another example:
 {code:sql}
 SELECT tmp1.key
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   UNION ALL
   SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
 {\code}
 For this case, we will have three Map-only jobs (two for MapJoins and one for 
 Union). It will be good to use a single Map-only job to execute this query.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

2013-07-09 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704063#comment-13704063
 ] 

Yin Huai commented on HIVE-4827:


I have been studying the code related to MapJoinOperator and 
CommonJoinResolver. Besides CommonJoinResolver, seems that we also need to 
change MapJoinOperator to load correct hash tables because some hash tables may 
not be needed by a MapJoinOperator.

 Merge a Map-only job to its following MapReduce job with multiple inputs
 

 Key: HIVE-4827
 URL: https://issues.apache.org/jira/browse/HIVE-4827
 Project: Hive
  Issue Type: Improvement
Reporter: Yin Huai
Assignee: Yin Huai

 When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
 Map-only job (MapJoin) to its following MapReduce job. But this merge only 
 happens when the MapReduce job has a single input. With Correlation Optimizer 
 (HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
 (for multiple operation paths). It is desired to improve CommonJoinResolver 
 to merge a Map-only job to the corresponding Map task of the MapReduce job.
 Example:
 {code:sql}
 SELECT tmp1.key, count(*)
 FROM (SELECT x1.key2 AS key
   FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
   GROUP BY x1.key2) tmp1
 JOIN (SELECT x2.key2 AS key
   FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)
   GROUP BY x2.key2) tmp2
 ON (tmp1.key = tmp2.key)
 GROUP BY tmp1.key;
 {\code}
 In this query, join operations inside tmp1 and tmp2 will be converted to two 
 MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
 tmp1 and tmp2, and the last aggregation will be executed in the same 
 MapReduce job (Reduce side). Since this MapReduce job has two inputs, right 
 now, CommonJoinResolver cannot attach two MapJoins to the Map side of a 
 MapReduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira