Adesh Kumar Rao created HIVE-16499: -------------------------------------- Summary: [Tez] CommonMergeJoin Operator is taking longer to join rows as compared to MR Key: HIVE-16499 URL: https://issues.apache.org/jira/browse/HIVE-16499 Project: Hive Issue Type: Bug Affects Versions: 1.2.0, 1.3.0 Reporter: Adesh Kumar Rao
It can be reproduced by a reduce side join (Using the patch available in HIVE-16498 as reading useless data will mask the longer time taken issue described here). The data for large_table is generated by following shell script and a table can be created from the file `large.txt` {code:java} for (( j=1 ; j <=20; j++)) do for (( i=1; i <= 1000000; i++ )) do echo "$i,$j" >> large.txt done done {code} {code:java} create external table large_table ( i int, j int) row format delimited fields terminated by ',' location "hdfs://<some-hdfs-location>"; set hive.auto.convert.join=false; -- So that reduce side join is used instead of MapJoin select * from large_table a join large_table b on a,j = b.j limit 100; {code} The issue is different from HIVE-16498 as Tez is taking time in join operator instead of reading extra data. Applied the patch available for HIVE-16498 and ran the above join query. It is taking around 30-40 minutes as compared to 5 minutes on MR. -- This message was sent by Atlassian JIRA (v6.3.15#6346)