[
https://issues.apache.org/jira/browse/HIVE-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-27267:
----------------------------------
Labels: pull-request-available (was: )
> Incorrect results when doing bucket map join on decimal bucketed column with
> subquery
> -------------------------------------------------------------------------------------
>
> Key: HIVE-27267
> URL: https://issues.apache.org/jira/browse/HIVE-27267
> Project: Hive
> Issue Type: Bug
> Reporter: Sourabh Badhya
> Assignee: Seonggon Namgung
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The following queries when run on a Hive cluster produce no results -
> Repro queries -
> {code:java}
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.support.concurrency=true;
> set hive.convert.join.bucket.mapjoin.tez=true;
> drop table if exists test_external_source;
> create external table test_external_source (date_col date, string_col string,
> decimal_col decimal(38,0)) stored as orc tblproperties
> ('external.table.purge'='true');
> insert into table test_external_source values ('2022-08-30', 'pipeline',
> '50000000000000000005905545593'), ('2022-08-16', 'pipeline',
> '50000000000000000005905545593'), ('2022-09-01', 'pipeline',
> '50000000000000000006008686831'), ('2022-08-30', 'pipeline',
> '50000000000000000005992620837'), ('2022-09-01', 'pipeline',
> '50000000000000000005992620837'), ('2022-09-01', 'pipeline',
> '50000000000000000005992621067'), ('2022-08-30', 'pipeline',
> '50000000000000000005992621067');
> drop table if exists test_external_target;
> create external table test_external_target (date_col date, string_col string,
> decimal_col decimal(38,0)) stored as orc tblproperties
> ('external.table.purge'='true');
> insert into table test_external_target values ('2017-05-17', 'pipeline',
> '50000000000000000000441610525'), ('2018-12-20', 'pipeline',
> '50000000000000000001048981030'), ('2020-06-30', 'pipeline',
> '50000000000000000002332575516'), ('2021-08-16', 'pipeline',
> '50000000000000000003897973989'), ('2017-06-06', 'pipeline',
> '50000000000000000000449148729'), ('2017-09-08', 'pipeline',
> '50000000000000000000525378314'), ('2022-08-30', 'pipeline',
> '50000000000000000005905545593'), ('2022-08-16', 'pipeline',
> '50000000000000000005905545593'), ('2018-05-03', 'pipeline',
> '50000000000000000000750826355'), ('2020-01-10', 'pipeline',
> '50000000000000000001816579677'), ('2021-11-01', 'pipeline',
> '50000000000000000004269423714'), ('2017-11-07', 'pipeline',
> '50000000000000000000585901787'), ('2019-10-15', 'pipeline',
> '50000000000000000001598843430'), ('2020-04-01', 'pipeline',
> '50000000000000000002035795461'), ('2020-02-24', 'pipeline',
> '50000000000000000001932600185'), ('2020-04-27', 'pipeline',
> '50000000000000000002108160849'), ('2016-07-05', 'pipeline',
> '50000000000000000000054405114'), ('2020-06-02', 'pipeline',
> '50000000000000000002234387967'), ('2020-08-21', 'pipeline',
> '50000000000000000002529168758'), ('2021-02-17', 'pipeline',
> '50000000000000000003158511687');
> drop table if exists target_table;
> drop table if exists source_table;
> create table target_table(date_col date, string_col string, decimal_col
> decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc
> tblproperties ('bucketing_version'='2', 'transactional'='true',
> 'transactional_properties'='default');
> create table source_table(date_col date, string_col string, decimal_col
> decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc
> tblproperties ('bucketing_version'='2', 'transactional'='true',
> 'transactional_properties'='default');
> insert into table target_table select * from test_external_target;
> insert into table source_table select * from test_external_source; {code}
> Query which is under investigation -
> {code:java}
> select * from target_table inner join (select distinct date_col, 'pipeline'
> string_col, decimal_col from source_table where coalesce(decimal_col,'') =
> '50000000000000000005905545593') s on s.date_col = target_table.date_col AND
> s.string_col = target_table.string_col AND s.decimal_col =
> target_table.decimal_col; {code}
> Expected result of the query - 2 records
> {code:java}
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | target_table.date_col | target_table.string_col |
> target_table.decimal_col | s.date_col | s.string_col |
> s.decimal_col |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | 2022-08-16 | pipeline |
> 50000000000000000005905545593 | 2022-08-16 | pipeline |
> 50000000000000000005905545593 |
> | 2022-08-30 | pipeline |
> 50000000000000000005905545593 | 2022-08-30 | pipeline |
> 50000000000000000005905545593 |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> {code}
> Actual result of the query - No records
> {code:java}
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> | target_table.date_col | target_table.string_col |
> target_table.decimal_col | s.date_col | s.string_col | s.decimal_col |
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> {code}
> The workaround which fetches the correct result here is to set the below
> config to false -
> {code:java}
> set hive.convert.join.bucket.mapjoin.tez=false;{code}
> Notes from investigation -
> 1. The batch containing the 2 results are forwarded correctly to the map join
> operator. However, during the join comparision, the hash table is empty.
> 2. The problem seems to be that even though HashTableDummyOperator performs
> loading of hash table with the records, however the map join operator does
> not take into account all the hash tables from various instances of
> HashTableDummyOperator (due to multiple map tasks initiated by bucket map
> join) but rather uses only one hash table from one of the
> HashTableDummyOperator instance. In this case, the selected instance had an
> empty hash table hence no records were matched in the join operator.
> 3. If the table is unbucketed / 1-bucketed, then the results are correct.
> There is only 1 map task which is spawned which loads the records into the
> hash table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to
> {*}false{*}) also has the same effect since there is 1 map task which loads
> the records into the hash table.
> 4. HashTableDummyOperator is created in the optimizer and is associated with
> the plan, hence suspecting there is a some issue in the optimizer code.
> Ideally, all hash tables from all instances of HashTableDummyOperator must be
> used by the map join operator.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)