[jira] [Updated] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

ASF GitHub Bot (Jira) Fri, 05 May 2023 02:07:30 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HIVE-27267:
----------------------------------
    Labels: pull-request-available  (was: )

> Incorrect results when doing bucket map join on decimal bucketed column with 
> subquery
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-27267
>                 URL: https://issues.apache.org/jira/browse/HIVE-27267
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sourabh Badhya
>            Assignee: Seonggon Namgung
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following queries when run on a Hive cluster produce no results - 
> Repro queries - 
> {code:java}
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.support.concurrency=true;
> set hive.convert.join.bucket.mapjoin.tez=true;
> drop table if exists test_external_source;
> create external table test_external_source (date_col date, string_col string, 
> decimal_col decimal(38,0)) stored as orc tblproperties 
> ('external.table.purge'='true');
> insert into table test_external_source values ('2022-08-30', 'pipeline', 
> '50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
> '50000000000000000005905545593'), ('2022-09-01', 'pipeline', 
> '50000000000000000006008686831'), ('2022-08-30', 'pipeline', 
> '50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
> '50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
> '50000000000000000005992621067'), ('2022-08-30', 'pipeline', 
> '50000000000000000005992621067');
> drop table if exists test_external_target;
> create external table test_external_target (date_col date, string_col string, 
> decimal_col decimal(38,0)) stored as orc tblproperties 
> ('external.table.purge'='true');
> insert into table test_external_target values ('2017-05-17', 'pipeline', 
> '50000000000000000000441610525'), ('2018-12-20', 'pipeline', 
> '50000000000000000001048981030'), ('2020-06-30', 'pipeline', 
> '50000000000000000002332575516'), ('2021-08-16', 'pipeline', 
> '50000000000000000003897973989'), ('2017-06-06', 'pipeline', 
> '50000000000000000000449148729'), ('2017-09-08', 'pipeline', 
> '50000000000000000000525378314'), ('2022-08-30', 'pipeline', 
> '50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
> '50000000000000000005905545593'), ('2018-05-03', 'pipeline', 
> '50000000000000000000750826355'), ('2020-01-10', 'pipeline', 
> '50000000000000000001816579677'), ('2021-11-01', 'pipeline', 
> '50000000000000000004269423714'), ('2017-11-07', 'pipeline', 
> '50000000000000000000585901787'), ('2019-10-15', 'pipeline', 
> '50000000000000000001598843430'), ('2020-04-01', 'pipeline', 
> '50000000000000000002035795461'), ('2020-02-24', 'pipeline', 
> '50000000000000000001932600185'), ('2020-04-27', 'pipeline', 
> '50000000000000000002108160849'), ('2016-07-05', 'pipeline', 
> '50000000000000000000054405114'), ('2020-06-02', 'pipeline', 
> '50000000000000000002234387967'), ('2020-08-21', 'pipeline', 
> '50000000000000000002529168758'), ('2021-02-17', 'pipeline', 
> '50000000000000000003158511687');
> drop table if exists target_table;
> drop table if exists source_table;
> create table target_table(date_col date, string_col string, decimal_col 
> decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
> tblproperties ('bucketing_version'='2', 'transactional'='true', 
> 'transactional_properties'='default');
> create table source_table(date_col date, string_col string, decimal_col 
> decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
> tblproperties ('bucketing_version'='2', 'transactional'='true', 
> 'transactional_properties'='default');
> insert into table target_table select * from test_external_target;
> insert into table source_table select * from test_external_source; {code}
> Query which is under investigation - 
> {code:java}
> select * from target_table inner join (select distinct date_col, 'pipeline' 
> string_col, decimal_col from source_table where coalesce(decimal_col,'') = 
> '50000000000000000005905545593') s on s.date_col = target_table.date_col AND 
> s.string_col = target_table.string_col AND s.decimal_col = 
> target_table.decimal_col; {code}
> Expected result of the query - 2 records
> {code:java}
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | target_table.date_col  | target_table.string_col  |    
> target_table.decimal_col    | s.date_col  | s.string_col  |         
> s.decimal_col          |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | 2022-08-16             | pipeline                 | 
> 50000000000000000005905545593  | 2022-08-16  | pipeline      | 
> 50000000000000000005905545593  |
> | 2022-08-30             | pipeline                 | 
> 50000000000000000005905545593  | 2022-08-30  | pipeline      | 
> 50000000000000000005905545593  |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
>  {code}
> Actual result of the query - No records
> {code:java}
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> | target_table.date_col  | target_table.string_col  | 
> target_table.decimal_col  | s.date_col  | s.string_col  | s.decimal_col  |
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
>  {code}
> The workaround which fetches the correct result here is to set the below 
> config to false - 
> {code:java}
> set hive.convert.join.bucket.mapjoin.tez=false;{code}
> Notes from investigation - 
> 1. The batch containing the 2 results are forwarded correctly to the map join 
> operator. However, during the join comparision, the hash table is empty.
> 2. The problem seems to be that even though HashTableDummyOperator performs 
> loading of hash table with the records, however the map join operator does 
> not take into account all the hash tables from various instances of 
> HashTableDummyOperator (due to multiple map tasks initiated by bucket map 
> join) but rather uses only one hash table from one of the 
> HashTableDummyOperator instance. In this case, the selected instance had an 
> empty hash table hence no records were matched in the join operator.
> 3. If the table is unbucketed / 1-bucketed, then the results are correct. 
> There is only 1 map task which is spawned which loads the records into the 
> hash table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to 
> {*}false{*}) also has the same effect since there is 1 map task which loads 
> the records into the hash table.
> 4. HashTableDummyOperator is created in the optimizer and is associated with 
> the plan, hence suspecting there is a some issue in the optimizer code. 
> Ideally, all hash tables from all instances of HashTableDummyOperator must be 
> used by the map join operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

Reply via email to