[jira] [Updated] (HIVE-27357) Map-side SMB Join returns incorrect result when it 2 tables have different bucket size

ASF GitHub Bot (Jira) Wed, 05 Jul 2023 01:13:11 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HIVE-27357:
----------------------------------
    Labels: pull-request-available  (was: )

> Map-side SMB Join returns incorrect result when it 2 tables have different 
> bucket size
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-27357
>                 URL: https://issues.apache.org/jira/browse/HIVE-27357
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.3, 4.0.0-alpha-2
>            Reporter: Seonggon Namgung
>            Assignee: Seonggon Namgung
>            Priority: Major
>              Labels: pull-request-available
>
> The following query returns \{(1, 1), (2, 2), (7, 7)} instead of \{(1, 1), 
> (2, 2), (7, 7), (6, 6), (14, 14), (11, 11)}.
>  
>  
> {code:java}
> set hive.strict.checks.bucketing=true;
> set hive.auto.convert.join=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.optimize.bucketmapjoin = true;
> set hive.optimize.bucketmapjoin.sortedmerge = true;
> set hive.auto.convert.join.noconditionaltask.size=1;
> set hive.optimize.dynamic.partition.hashjoin=false;
> DROP TABLE IF EXISTS bucket2;
> CREATE TABLE bucket2(key string, value string) CLUSTERED BY (key) SORTED BY 
> (key) INTO 2 BUCKETS;
> DROP TABLE IF EXISTS bucket3;
> CREATE TABLE bucket3(key string, value string) CLUSTERED BY (key) SORTED BY 
> (key) INTO 3 BUCKETS;
> INSERT INTO TABLE bucket2 VALUES (1, 1), (2, 2), (7, 7), (6, 6), (14, 14), 
> (11, 11);
> INSERT INTO TABLE bucket3 VALUES (1, 1), (2, 2), (7, 7), (6, 6), (14, 14), 
> (11, 11);
> SELECT * FROM bucket2 JOIN bucket3 on bucket2.key = bucket3.key; {code}
>  
>  
> It is known that sort-merge join is used when two tables have the same number 
> of buckets, but I could not find such restriction from the source code. Also, 
> current Hive uses map side SMB join for the above query, which joins 2 
> buckets table and 3 buckets table. So I'm planning to fix this issue not by 
> using another Join algorithms.
> Originally, we found this issue by running auto_sortmerge_join_12.q with 
> hive.strict.checks.bucketing=true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27357) Map-side SMB Join returns incorrect result when it 2 tables have different bucket size

Reply via email to