[
https://issues.apache.org/jira/browse/IMPALA-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksey updated IMPALA-13249:
-----------------------------
Description:
The problem occurs when _count(*)_ is queried within a UNION upon Iceberg V2
table values that had been inserted later on, when the table was already
populated with data.
{code:java}
create table prod_wrk.test_iceberg_v2
(
id_field integer,
data_field string
)
stored as iceberg
TBLPROPERTIES('format-version'='2');
insert into prod_wrk.test_iceberg_v2 values (1, "A"), (2, "B"), (3, "C");
-- Prepare table
delete from prod_wrk.test_iceberg_v2
where id_field = 1;
-- Remove first record
insert into prod_wrk.test_iceberg_v2 values (1, "D"), (4, "E");
-- Insert new records with data_field values that were not present before
select *
from prod_wrk.test_iceberg_v2
order by data_field;
-- data_field = 'D' present
{code}
||id_field ||data_field||
|2|B|
|3|C|
|1|D|
|4|E|
At this point querying prod_wrk.test_iceberg_v2 is expected to yield one value
with _data_field = 'D'_ in any context
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
union all
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- the result coming from count(1) with the WHERE clause is incorrect (0){code}
||count(1)||
|4|
|0|
Expected result:
||count(1)||
|4|
|1|
However, altering the forementioned query for the COUNT optimization to come
off produces expected results:
{code:java}
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
union all
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- results become correct{code}
||count(1)||sum(1)||
|4|4|
|1|1|
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- result becomes correct as well{code}
||count(1)||
|1|
Count optimization was introduced by IMPALA-11802
was:
The problem occurs when count(*) is queried within a UNION upon Iceberg V2
table values that had been inserted later on, when the table was already
populated with data.
{code:java}
create table prod_wrk.test_iceberg_v2
(
id_field integer,
data_field string
)
stored as iceberg
TBLPROPERTIES('format-version'='2');
insert into prod_wrk.test_iceberg_v2 values (1, "A"), (2, "B"), (3, "C");
-- Prepare table
delete from prod_wrk.test_iceberg_v2
where id_field = 1;
-- Remove first record
insert into prod_wrk.test_iceberg_v2 values (1, "D"), (4, "E");
-- Insert new records with data_field values that were not present before
select *
from prod_wrk.test_iceberg_v2
order by data_field;
-- data_field = 'D' present
{code}
||id_field ||data_field||
|2|B|
|3|C|
|1|D|
|4|E|
At this point querying prod_wrk.test_iceberg_v2 is expected to yield one value
with _data_field = 'D'_ in any context
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
union all
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- the result coming from count(1) with the WHERE clause is incorrect (0){code}
||count(1)||
|4|
|0|
Expected result:
||count(1)||
|4|
|1|
However, altering the forementioned query for the COUNT optimization to come
off produces expected results:
{code:java}
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
union all
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- results become correct{code}
||count(1)||sum(1)||
|4|4|
|1|1|
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- result becomes correct as well{code}
||count(1)||
|1|
Count optimization was introduced by
[IMPALA-11802|https://issues.apache.org/jira/browse/IMPALA-11802]
> COUNT(*) query optimisation for Iceberg V2 tables breaks UNION queries in
> some cases
> ------------------------------------------------------------------------------------
>
> Key: IMPALA-13249
> URL: https://issues.apache.org/jira/browse/IMPALA-13249
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 4.4.0
> Reporter: Aleksey
> Priority: Major
> Labels: impala-iceberg
>
>
> The problem occurs when _count(*)_ is queried within a UNION upon Iceberg V2
> table values that had been inserted later on, when the table was already
> populated with data.
> {code:java}
> create table prod_wrk.test_iceberg_v2
> (
> id_field integer,
> data_field string
> )
> stored as iceberg
> TBLPROPERTIES('format-version'='2');
> insert into prod_wrk.test_iceberg_v2 values (1, "A"), (2, "B"), (3, "C");
> -- Prepare table
> delete from prod_wrk.test_iceberg_v2
> where id_field = 1;
> -- Remove first record
> insert into prod_wrk.test_iceberg_v2 values (1, "D"), (4, "E");
> -- Insert new records with data_field values that were not present before
> select *
> from prod_wrk.test_iceberg_v2
> order by data_field;
> -- data_field = 'D' present
> {code}
>
>
> ||id_field ||data_field||
> |2|B|
> |3|C|
> |1|D|
> |4|E|
>
>
> At this point querying prod_wrk.test_iceberg_v2 is expected to yield one
> value with _data_field = 'D'_ in any context
>
> {code:java}
> select count(1)
> from prod_wrk.test_iceberg_v2
> union all
> select count(1)
> from prod_wrk.test_iceberg_v2
> where data_field = 'D';
> -- the result coming from count(1) with the WHERE clause is incorrect
> (0){code}
>
>
> ||count(1)||
> |4|
> |0|
>
> Expected result:
> ||count(1)||
> |4|
> |1|
>
> However, altering the forementioned query for the COUNT optimization to come
> off produces expected results:
>
> {code:java}
> select count(1), sum(1)
> from prod_wrk.test_iceberg_v2
> union all
> select count(1), sum(1)
> from prod_wrk.test_iceberg_v2
> where data_field = 'D';
> -- results become correct{code}
>
>
> ||count(1)||sum(1)||
> |4|4|
> |1|1|
>
>
> {code:java}
> select count(1)
> from prod_wrk.test_iceberg_v2
> where data_field = 'D';
> -- result becomes correct as well{code}
>
>
> ||count(1)||
> |1|
>
> Count optimization was introduced by IMPALA-11802
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]