[
https://issues.apache.org/jira/browse/IMPALA-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102554#comment-17102554
]
Xiaomin Zhang commented on IMPALA-9725:
---------------------------------------
Thanks [~tarmstrong] I use a sample dataset without NULL values to make the
reproduction easier, that Impala is wrong if query returns rows.
I used below DDL to create the tests after copying the attached sample data to
'/tmp/data'
{code:java}
cat | beeline -u jdbc:hive2://localhost:11050/ <<EOF
CREATE DATABASE IF NOT EXISTS wide_test;
USE wide_test;
DROP TABLE IF EXISTS wide;
CREATE TABLE IF NOT EXISTS wide(
id INT,
col2 STRING,
col3 STRING,
col4 STRING,
col5 STRING,
col6 STRING,
col7 STRING,
col8 STRING,
col9 STRING,
col10 STRING,
col11 STRING,
col12 STRING,
col13 STRING,
col14 STRING,
col15 STRING,
col16 STRING,
col17 STRING,
col18 STRING,
col19 STRING,
col20 STRING,
col21 STRING,
col22 STRING,
col23 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/test-warehouse/wide/';
LOAD DATA LOCAL INPATH '/tmp/data' OVERWRITE INTO TABLE wide;
EOF{code}
Then I can reproduce in impala-shell:
{code:java}
cat | impala-shell.sh <<EOF
invalidate metadata;
set mem_limit=70m;
select id from wide_test.wide L1 where not exists (
select 1 from wide_test.wide L2 where L1.id = L2.id
and L1.col2 = L2.col2
and L1.col3 = L2.col3
and L1.col4 = L2.col4
and L1.col5 = L2.col5
and L1.col6 = L2.col6
and L1.col7 = L2.col7
and L1.col8 = L2.col8
and L1.col9 = L2.col9
and L1.col10 = L2.col10
and L1.col11 = L2.col11
and L1.col12 = L2.col12
and L1.col13 = L2.col13
and L1.col14 = L2.col14
and L1.col15 = L2.col15
and L1.col16 = L2.col16
and L1.col17 = L2.col17
and L1.col18 = L2.col18
and L1.col19 = L2.col19
and L1.col20 = L2.col20
and L1.col21 = L2.col21
and L1.col22 = L2.col22
and L1.col23 = L2.col23
) order by id;
set mem_limit=80m;
select id from wide_test.wide L1 where not exists (
select 1 from wide_test.wide L2 where L1.id = L2.id
and L1.col2 = L2.col2
and L1.col3 = L2.col3
and L1.col4 = L2.col4
and L1.col5 = L2.col5
and L1.col6 = L2.col6
and L1.col7 = L2.col7
and L1.col8 = L2.col8
and L1.col9 = L2.col9
and L1.col10 = L2.col10
and L1.col11 = L2.col11
and L1.col12 = L2.col12
and L1.col13 = L2.col13
and L1.col14 = L2.col14
and L1.col15 = L2.col15
and L1.col16 = L2.col16
and L1.col17 = L2.col17
and L1.col18 = L2.col18
and L1.col19 = L2.col19
and L1.col20 = L2.col20
and L1.col21 = L2.col21
and L1.col22 = L2.col22
and L1.col23 = L2.col23
) order by id;
EOF
{code}
With mem_limit=70m, query returns 20+ rows which is incorrect. But with
mem_limit=80m, it returns zero row instead.
I checked the profiles, and the only difference is the bad query has spills.
> LEFT ANTI JOIN produces wrong result when PHJ build spills
> ----------------------------------------------------------
>
> Key: IMPALA-9725
> URL: https://issues.apache.org/jira/browse/IMPALA-9725
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 3.4.0
> Reporter: Xiaomin Zhang
> Assignee: Tim Armstrong
> Priority: Blocker
> Labels: correctness
>
> Using the attached data set, below query produced non-zero result when
> setting a small mem_limt. The expected result should be 0 because we are
> simply ANTI JOIN the same table.
> set mem_limit=100m;
> select id from wide_test.wide L1 where not exists (
> select 1 from wide_test.wide L2 where L1.id = L2.id
> and L1.col2 = L2.col2
> and L1.col3 = L2.col3
> and L1.col4 = L2.col4
> and L1.col5 = L2.col5
> and L1.col6 = L2.col6
> and L1.col7 = L2.col7
> and L1.col8 = L2.col8
> and L1.col9 = L2.col9
> and L1.col10 = L2.col10
> and L1.col11 = L2.col11
> and L1.col12 = L2.col12
> and L1.col13 = L2.col13
> and L1.col14 = L2.col14
> and L1.col15 = L2.col15
> and L1.col16 = L2.col16
> and L1.col17 = L2.col17
> and L1.col18 = L2.col18
> and L1.col19 = L2.col19
> and L1.col20 = L2.col20
> and L1.col21 = L2.col21
> and L1.col22 = L2.col22
> and L1.col23 = L2.col23
> ) order by id;
>
> With a larger mem_limit (or do not set mem_limit), above query return 0 which
> is correct.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]