[
https://issues.apache.org/jira/browse/HIVE-26653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039914#comment-18039914
]
Stamatis Zampetakis commented on HIVE-26653:
--------------------------------------------
I left some questions/comments for specific parts of the PR although I am not
really sure if there is really a bug in the implementation of the
{{VectorMapJoinInnerStringOperator}} or rather an issue at the compiler
generating an invalid plan.
Reading through the compiler logic and also through the Javadoc of
{{VectorMapJoinInnerBigOnly*}} variants it seems that when we don't need
anything from the small table to construct the join result we shouldn't use the
{{VectorMapJoinInnerStringOperator}} but the respective
{{VectorMapJoinInnerBigOnlyStringOperator}} implementation. In other words, I
don't think that {{VectorMapJoinInnerStringOperator}} was ever made to handle
the case where there are no values for the small table. [~sershe] or
[~mmccline] could easily confirm/reject the previous statement but not sure if
they are still following the project. Still giving a heads up in case we are
lucky and they happen to see this message :)
In the past we had situations where an incorrect choice between the
aforementioned operators was leading to wrong results so if there are
improvements that we can do to prevent this that would be great. Most likely
such improvements should be at the compiler side rather at the runtime.
However, we could also explore if there is a way to detect/guard against
invalid usages of a specific operator at runtime and raise exceptions if this
happens.
> Wrong results when (map) joining multiple tables on partition column
> --------------------------------------------------------------------
>
> Key: HIVE-26653
> URL: https://issues.apache.org/jira/browse/HIVE-26653
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Affects Versions: 4.2.0
> Reporter: Stamatis Zampetakis
> Assignee: Stamatis Zampetakis
> Priority: Major
> Labels: pull-request-available
> Attachments: hive_26653.q, hive_26653_explain.txt,
> hive_26653_explain_cbo.txt, table_a.csv, table_b.csv
>
>
> The result of the query must have exactly one row matching the date specified
> in the WHERE clause but the query returns nothing.
> {code:sql}
> CREATE TABLE table_a (`aid` string ) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tbla.csv' into TABLE table_a;
> CREATE TABLE table_b (`bid` string) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tblb.csv' into TABLE table_b;
> set hive.auto.convert.join=true;
> set hive.optimize.semijoin.conversion=false;
> SELECT a.p_dt
> FROM ((SELECT p_dt
> FROM table_b
> GROUP BY p_dt) a
> JOIN
> (SELECT p_dt
> FROM table_a
> GROUP BY p_dt) b ON a.p_dt = b.p_dt
> JOIN
> (SELECT p_dt
> FROM table_a
> GROUP BY p_dt) c ON a.p_dt = c.p_dt)
> WHERE a.p_dt = translate(cast(to_date(date_sub('2022-08-01', 1)) AS string),
> '-', '');
> {code}
> +Expected result+
> 20220731
> +Actual result+
> Empty
> To reproduce the problem the tables need to have some data. Values in aid and
> bid columns are not important. For p_dt column use one of the following
> values 20220731, 20220630.
> I will attach some sample data with which the problem can be reproduced. The
> tables look like below.
> ||aid|pdt||
> |611|20220731|
> |239|20220630|
> |...|...|
> The problem can be reproduced via qtest in current master
> (commit
> [6b05d64ce8c7161415d97a7896ea50025322e30a|https://github.com/apache/hive/commit/6b05d64ce8c7161415d97a7896ea50025322e30a])
> by running the TestMiniLlapLocalCliDriver.
> There is specific query plan (will attach shortly) for which the problem
> shows up so if the plan changes slightly the problem may not appear anymore;
> this is why we need to set explicitly hive.optimize.semijoin.conversion and
> hive.auto.convert.join to trigger the problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)