Quanlong Huang created IMPALA-13193:
---------------------------------------
Summary: RuntimeFilter on parquet dictionary should evaluate null
values
Key: IMPALA-13193
URL: https://issues.apache.org/jira/browse/IMPALA-13193
Project: IMPALA
Issue Type: Bug
Components: Backend
Reporter: Quanlong Huang
IMPALA-10910, IMPALA-5509 introduces an optimization to evaluate runtime filter
on parquet dictionary values. If non of the values can pass the check, the
whole row group will be skipped. However, NULL values are not included in the
parquet dictionary. Runtime filters that accept NULL values might incorrectly
reject the row group if none of the dictionary values can pass the check.
Here are steps to reproduce the bug:
{code:sql}
create table parq_tbl (id bigint, name string) stored as parquet;
insert into parq_tbl values (0, "abc"), (1, NULL), (2, NULL), (3, "abc");
create table dim_tbl (name string);
insert into dim_tbl values (NULL);
select * from parq_tbl p join dim_tbl d
on COALESCE(p.name, '') = COALESCE(d.name, '');{code}
The SELECT query should return 2 rows but now it returns 0 rows.
A workaround is to disable this optimization:
{code:sql}
set PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT=0;{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)