[jira] [Commented] (HIVE-26653) Wrong results when (map) joining multiple tables on partition column

Soumyakanti Das (Jira) Fri, 17 Oct 2025 16:21:58 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028506#comment-18028506
 ]


Soumyakanti Das commented on HIVE-26653:
----------------------------------------

[~zabetak] [~ghanko] 

This issue is no longer present in 
[https://github.com/apache/hive/commit/f75c18b57ae4218db689d1e76fabb1343fd78973]
 (current master).

With {{{}git bisect{}}}, the commit that fixed both issues is: 
[https://github.com/apache/hive/commit/36ce858163a19e29eafe4a8d3307191bc28fc175]

The plans look a bit different, but the results are expected.

Current plans:
{noformat}
CBO PLAN:
HiveProject($f0=[CAST(_UTF-16LE'20220731':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"):VARCHAR(2147483647) CHARACTER SET "UTF-16LE"])
  HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not 
available])
    HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not 
available])
      HiveProject($f0=[$0])
        HiveAggregate(group=[{0}])
          HiveProject($f0=[true])
            HiveFilter(condition=[=($1, _UTF-16LE'20220731')])
              HiveTableScan(table=[[default, table_b]], table:alias=[table_b])
      HiveProject($f0=[$0])
        HiveAggregate(group=[{0}])
          HiveProject($f0=[true])
            HiveFilter(condition=[=($1, _UTF-16LE'20220731')])
              HiveTableScan(table=[[default, table_a]], table:alias=[table_a])
    HiveProject($f0=[$0])
      HiveAggregate(group=[{0}])
        HiveProject($f0=[true])
          HiveFilter(condition=[=($1, _UTF-16LE'20220731')])
            HiveTableScan(table=[[default, table_a]], 
table:alias=[table_a])STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 4 <- Map 3 (SIMPLE_EDGE)
        Reducer 5 <- Map 3 (SIMPLE_EDGE), Reducer 2 (BROADCAST_EDGE), Reducer 4 
(BROADCAST_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: table_b
                  filterExpr: (p_dt = '20220731') (type: boolean)
                  Statistics: Num rows: 29 Data size: 319 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Select Operator
                    Statistics: Num rows: 29 Data size: 319 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Group By Operator
                      keys: true (type: boolean)
                      minReductionHashAggr: 0.9655172
                      mode: hash
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: boolean)
                        null sort order: z
                        sort order: +
                        Map-reduce partition columns: _col0 (type: boolean)
                        Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
            LLAP IO: all inputs
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: table_a
                  filterExpr: (p_dt = '20220731') (type: boolean)
                  Statistics: Num rows: 31 Data size: 341 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Select Operator
                    Statistics: Num rows: 31 Data size: 341 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Group By Operator
                      keys: true (type: boolean)
                      minReductionHashAggr: 0.96774197
                      mode: hash
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: boolean)
                        null sort order: z
                        sort order: +
                        Map-reduce partition columns: _col0 (type: boolean)
                        Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: boolean)
                        null sort order: z
                        sort order: +
                        Map-reduce partition columns: _col0 (type: boolean)
                        Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
            LLAP IO: all inputs
        Reducer 2 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: boolean)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: COMPLETE
                Select Operator
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Reduce Output Operator
                    null sort order: 
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
        Reducer 4 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: boolean)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: COMPLETE
                Select Operator
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Reduce Output Operator
                    null sort order: 
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
        Reducer 5 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: boolean)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: COMPLETE
                Select Operator
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Map Join Operator
                    condition map:
                         Inner Join 0 to 1
                    keys:
                      0 
                      1 
                    input vertices:
                      0 Reducer 2
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 
                        1 
                      input vertices:
                        1 Reducer 4
                      Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: '20220731' (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 92 Basic stats: 
COMPLETE Column stats: COMPLETE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 1 Data size: 92 Basic stats: 
COMPLETE Column stats: COMPLETE
                          table:
                              input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                              output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                              serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink{noformat}
 

I have also tested the query mentioned in the linked ticket and it is working 
fine too. I think we can mark both of them as resolved.

> Wrong results when (map) joining multiple tables on partition column
> --------------------------------------------------------------------
>
>                 Key: HIVE-26653
>                 URL: https://issues.apache.org/jira/browse/HIVE-26653
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>         Attachments: hive_26653.q, hive_26653_explain.txt, 
> hive_26653_explain_cbo.txt, table_a.csv, table_b.csv
>
>
> The result of the query must have exactly one row matching the date specified 
> in the WHERE clause but the query returns nothing.
> {code:sql}
> CREATE TABLE table_a (`aid` string ) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tbla.csv' into TABLE table_a;
> CREATE TABLE table_b (`bid` string) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tblb.csv' into TABLE table_b;
> set hive.auto.convert.join=true;
> set hive.optimize.semijoin.conversion=false;
> SELECT a.p_dt
> FROM ((SELECT p_dt
>        FROM table_b
>        GROUP BY p_dt) a
>          JOIN
>      (SELECT p_dt
>       FROM table_a
>       GROUP BY p_dt) b ON a.p_dt = b.p_dt
>          JOIN
>      (SELECT p_dt
>       FROM table_a
>       GROUP BY p_dt) c ON a.p_dt = c.p_dt)
> WHERE a.p_dt =  translate(cast(to_date(date_sub('2022-08-01', 1)) AS string), 
> '-', '');
> {code}
> +Expected result+
> 20220731
> +Actual result+
> Empty
> To reproduce the problem the tables need to have some data. Values in aid and 
> bid columns are not important. For p_dt column use one of the following 
> values 20220731, 20220630.
> I will attach some sample data with which the problem can be reproduced. The 
> tables look like below.
> ||aid|pdt||
> |611|20220731|
> |239|20220630|
> |...|...|
> The problem can be reproduced via qtest in current master 
> (commit 
> [6b05d64ce8c7161415d97a7896ea50025322e30a|https://github.com/apache/hive/commit/6b05d64ce8c7161415d97a7896ea50025322e30a])
>  by running the TestMiniLlapLocalCliDriver.
> There is specific query plan (will attach shortly) for which the problem 
> shows up so if the plan changes slightly the problem may not appear anymore; 
> this is why we need to set explicitly hive.optimize.semijoin.conversion and 
> hive.auto.convert.join to trigger the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26653) Wrong results when (map) joining multiple tables on partition column

Reply via email to