[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13873656#comment-13873656 ] Yin Huai commented on HIVE-5945: Committed to trunk. Thanks, Navis! ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Fix For: 0.13.0 Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872582#comment-13872582 ] Yin Huai commented on HIVE-5945: +1 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869351#comment-13869351 ] Hive QA commented on HIVE-5945: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12622571/HIVE-5945.7.patch.txt {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 4917 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join30 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join31 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_filters org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union22 org.apache.hadoop.hive.ql.plan.TestConditionalResolverCommonJoin.testResolvingDriverAlias {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/877/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/877/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12622571 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869579#comment-13869579 ] Hive QA commented on HIVE-5945: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12622597/HIVE-5945.8.patch.txt {color:green}SUCCESS:{color} +1 4917 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/882/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/882/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12622597 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863047#comment-13863047 ] Yin Huai commented on HIVE-5945: Thanks Navis for the change. date_dim is a native table. Actually, I think the problem is org.apache.hadoop.hive.ql.plan.ConditionalResolverCommonJoin.getParticipants. It uses ctx.getAliasToTask(); to get all aliases. However, these aliases do not include aliases appearing in the MapLocalWork (those small tables.). So for a query like {code} set hive.auto.convert.join.noconditionaltask=false; select i_item_id FROM store_sales JOIN item on (store_sales.ss_item_sk = item.i_item_sk) limit 10; {code} The plan is {code} STAGE DEPENDENCIES: Stage-5 is a root stage , consists of Stage-6, Stage-1 Stage-6 has a backup stage: Stage-1 Stage-3 depends on stages: Stage-6 Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-5 Conditional Operator Stage: Stage-6 Map Reduce Local Work Alias - Map Local Tables: item Fetch Operator limit: -1 Alias - Map Local Operator Tree: item TableScan alias: item HashTable Sink Operator condition expressions: 0 1 {i_item_id} handleSkewJoin: false keys: 0 [Column[ss_item_sk]] 1 [Column[i_item_sk]] Position of Big Table: 0 Stage: Stage-3 Map Reduce Alias - Map Operator Tree: store_sales TableScan alias: store_sales Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 1 {i_item_id} handleSkewJoin: false keys: 0 [Column[ss_item_sk]] 1 [Column[i_item_sk]] outputColumnNames: _col26 Position of Big Table: 0 Select Operator expressions: expr: _col26 type: string outputColumnNames: _col0 Limit File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-1 Map Reduce Alias - Map Operator Tree: item TableScan alias: item Reduce Output Operator key expressions: expr: i_item_sk type: int sort order: + Map-reduce partition columns: expr: i_item_sk type: int tag: 1 value expressions: expr: i_item_id type: string store_sales TableScan alias: store_sales Reduce Output Operator key expressions: expr: ss_item_sk type: int sort order: + Map-reduce partition columns: expr: ss_item_sk type: int tag: 0 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 1 {VALUE._col1} handleSkewJoin: false outputColumnNames: _col26 Select Operator expressions: expr: _col26 type: string outputColumnNames: _col0 Limit File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: 10 {code} The alias of item will not be in the set returned by getParticipants. Thus, the input of sumOfExcept will be {code} aliasToSize: {store_sales=388445409, item=5051899} aliases: [store_sales] except: store_sales {code} and then we get 0 for the size of small tables. I think in getParticipants, we can check the type of a task and if it is a MapRedTask, we can use getWork().getMapWork().getMapLocalWork() to get the local task. Then, we can get aliases of those small tables through aliasToWork. Another minor comment. Can you add a comment
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861465#comment-13861465 ] Hive QA commented on HIVE-5945: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12621023/HIVE-5945.6.patch.txt {color:green}SUCCESS:{color} +1 4873 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/790/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/790/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12621023 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, HIVE-5945.6.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859982#comment-13859982 ] Navis commented on HIVE-5945: - [~yhuai] Ah.. right. I'll check that (and zero size issue also). Thanks for the enlightening. ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859114#comment-13859114 ] Yin Huai commented on HIVE-5945: Thanks Navis :) I played with your patch and found a issue which I commented at the review board. I am also attaching more info at here. For the query in the description, we can have 4 map-joins. There will be 3 different intermediate tables called $INTNAME. The current patch does not update the size of $INTNAME. Here are logs. {code} 13/12/30 16:48:25 INFO ql.Driver: MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 12.76 sec HDFS Read: 388445624 HDFS Write: 20815654 SUCCESS 13/12/30 16:48:25 INFO ql.Driver: Job 0: Map: 1 Cumulative CPU: 12.76 sec HDFS Read: 388445624 HDFS Write: 20815654 SUCCESS Job 1: Map: 1 Cumulative CPU: 9.18 sec HDFS Read: 20816111 HDFS Write: 28593993 SUCCESS 13/12/30 16:48:25 INFO ql.Driver: Job 1: Map: 1 Cumulative CPU: 9.18 sec HDFS Read: 20816111 HDFS Write: 28593993 SUCCESS Job 2: Map: 1 Cumulative CPU: 17.38 sec HDFS Read: 80660331 HDFS Write: 378063 SUCCESS 13/12/30 16:48:25 INFO ql.Driver: Job 2: Map: 1 Cumulative CPU: 17.38 sec HDFS Read: 80660331 HDFS Write: 378063 SUCCESS Job 3: Map: 1 Cumulative CPU: 2.06 sec HDFS Read: 378520 HDFS Write: 96 SUCCESS 13/12/30 16:48:25 INFO ql.Driver: Job 3: Map: 1 Cumulative CPU: 2.06 sec HDFS Read: 378520 HDFS Write: 96 SUCCESS Job 4: Map: 1 Reduce: 1 Cumulative CPU: 2.45 sec HDFS Read: 553 HDFS Write: 96 SUCCESS 13/12/30 16:48:25 INFO ql.Driver: Job 4: Map: 1 Reduce: 1 Cumulative CPU: 2.45 sec HDFS Read: 553 HDFS Write: 96 SUCCESS Job 5: Map: 1 Reduce: 1 Cumulative CPU: 2.33 sec HDFS Read: 553 HDFS Write: 0 SUCCESS 13/12/30 16:48:25 INFO ql.Driver: Job 5: Map: 1 Reduce: 1 Cumulative CPU: 2.33 sec HDFS Read: 553 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 46 seconds 160 msec {code} {code} Map-join1: plan.ConditionalResolverCommonJoin: Driver alias is store_sales with size 388445409 (total size of others : 0, threshold : 2500) Stage-28 is selected by condition resolver. Map-join2: plan.ConditionalResolverCommonJoin: Driver alias is $INTNAME with size 20815654 (total size of others : 5051899, threshold : 2500) Stage-26 is selected by condition resolver. Map-join3: plan.ConditionalResolverCommonJoin: Driver alias is customer_demographics with size 80660096 (total size of others : 20815654, threshold : 2500) Stage-24 is filtered out by condition resolver. Map-join4: plan.ConditionalResolverCommonJoin: Driver alias is $INTNAME with size 20815654 (total size of others : 3155, threshold : 2500) Stage-22 is selected by condition resolver. {code} btw, a minor question. Why the log of map-join 1 shows the size of others 0? ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13858568#comment-13858568 ] Hive QA commented on HIVE-5945: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12620793/HIVE-5945.5.patch.txt {color:green}SUCCESS:{color} +1 4818 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/767/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/767/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12620793 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857297#comment-13857297 ] Hive QA commented on HIVE-5945: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12620580/HIVE-5945.4.patch.txt {color:green}SUCCESS:{color} +1 4817 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/747/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/747/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12620580 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851756#comment-13851756 ] Yin Huai commented on HIVE-5945: Two minor comments in the review board. Two additional comments. When we find {code} bigTableFileAlias != null {\code} can we also log sumOfOthers and the threshold of the size of small tables? So, the log entry will show the size of the big table, the total size of other small tables, and the threshold of the size of small tables. Also, can you add a unit test? Thanks :) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851260#comment-13851260 ] Yin Huai commented on HIVE-5945: Thanks [~navis] :) I left a few comments on the review board. I think the conditional task in the original trunk is not well tested. With a .q test file, we cannot test if a conditional task picks the right execution plan because the result of a .q file only shows the plan and the result. I think it is necessary to add a junit test to unit test the decision of resolveMapJoinTask. Also, let's add some logs in resolveMapJoinTask. Right now, we only have xx is filtered out by condition resolver. and xx is selected by condition resolver. in ConditionalTask. Through these two logs, we cannot know why a execution plan is selected. In resolveMapJoinTask, we can first log the size of tables which will be used in next task and then log why a path is selected. ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851459#comment-13851459 ] Hive QA commented on HIVE-5945: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12619255/HIVE-5945.3.patch.txt {color:green}SUCCESS:{color} +1 4762 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/682/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/682/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12619255 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845262#comment-13845262 ] Hive QA commented on HIVE-5945: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12618185/HIVE-5945.2.patch.txt {color:green}SUCCESS:{color} +1 4761 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/611/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/611/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12618185 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844912#comment-13844912 ] Navis commented on HIVE-5945: - Running test. I've heard MapJoin is not working after upgrading 0.11.0. HIVE-4042 (ignoring mapjoin hint) which is included 0.11.0 seemed revealed this issue. ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844968#comment-13844968 ] Yin Huai commented on HIVE-5945: Thanks [~navis] for taking this issue. Can you attach the link to the review board? Also, I saw {code} +// todo: should nullify summary for non-native tables, +// not to be selected as a mapjoin target {\code} in your patch. Does a non-native table mean an intermediate table? If so, I think for a conditional task, it's better to keep the option to use the intermediate table as the small table. ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844980#comment-13844980 ] Navis commented on HIVE-5945: - non-native table is table based on custom storage handler something like HBaseStorageHandler. In this case input summary for the table directory always contains 0 file and 0 length, which might confuse mapjoin resolver to take the table small enough to be hashed. I'll make a review board entry. ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
[ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845045#comment-13845045 ] Hive QA commented on HIVE-5945: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12618153/HIVE-5945.1.patch.txt {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 4761 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_infer_bucket_sort_convert_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_hook org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_scan_params {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/605/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/605/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12618153 ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task. - Key: HIVE-5945 URL: https://issues.apache.org/jira/browse/HIVE-5945 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0 Reporter: Yin Huai Assignee: Navis Priority: Critical Attachments: HIVE-5945.1.patch.txt Here is an example {code} select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = 'Primary' and d_year = 2002 and s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL') group by i_item_id, s_state order by i_item_id, s_state limit 100; {\code} I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.) So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted. -- This message was sent by Atlassian JIRA (v6.1.4#6159)