[jira] [Updated] (IMPALA-13249) COUNT(*) query optimisation for Iceberg V2 tables breaks UNION queries in some cases

Aleksey (Jira) Mon, 22 Jul 2024 06:49:03 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aleksey updated IMPALA-13249:
-----------------------------
    Description: 
 

The problem occurs when _count(*)_ is queried within a UNION upon Iceberg V2 
table values that had been inserted later on, when the table was already 
populated with data.
{code:java}
create table prod_wrk.test_iceberg_v2
(
 id_field integer,
 data_field string
)
stored as iceberg
TBLPROPERTIES('format-version'='2');
insert into prod_wrk.test_iceberg_v2 values (1, "A"), (2, "B"), (3, "C");
-- Prepare table
delete from prod_wrk.test_iceberg_v2
where id_field = 1;
-- Remove first record
insert into prod_wrk.test_iceberg_v2 values (1, "D"), (4, "E");
-- Insert new records with data_field values that were not present before
select *
from prod_wrk.test_iceberg_v2
order by data_field;
-- data_field = 'D' present
{code}
 

 
||id_field ||data_field||
|2|B|
|3|C|
|1|D|
|4|E|

 

 

At this point querying prod_wrk.test_iceberg_v2 is expected to yield one value 
with _data_field = 'D'_ in any context

 
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
union all
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- the result coming from count(1) with the WHERE clause is incorrect (0){code}
 

 
||count(1)||
|4|
|0|

 

Expected result:
||count(1)||
|4|
|1|

 

However, altering the forementioned query for the COUNT optimization to come 
off produces expected results:

 
{code:java}
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
union all
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- results become correct{code}
 

 
||count(1)||sum(1)||
|4|4|
|1|1|

 

 
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- result becomes correct as well{code}
 

 
||count(1)||
|1|

 

Count optimization was introduced by IMPALA-11802

  was:
 

The problem occurs when count(*) is queried within a UNION upon Iceberg V2 
table values that had been inserted later on, when the table was already 
populated with data.
{code:java}
create table prod_wrk.test_iceberg_v2
(
 id_field integer,
 data_field string
)
stored as iceberg
TBLPROPERTIES('format-version'='2');
insert into prod_wrk.test_iceberg_v2 values (1, "A"), (2, "B"), (3, "C");
-- Prepare table
delete from prod_wrk.test_iceberg_v2
where id_field = 1;
-- Remove first record
insert into prod_wrk.test_iceberg_v2 values (1, "D"), (4, "E");
-- Insert new records with data_field values that were not present before
select *
from prod_wrk.test_iceberg_v2
order by data_field;
-- data_field = 'D' present
{code}
 

 
||id_field ||data_field||
|2|B|
|3|C|
|1|D|
|4|E|

 

 

At this point querying prod_wrk.test_iceberg_v2 is expected to yield one value 
with _data_field = 'D'_ in any context

 
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
union all
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- the result coming from count(1) with the WHERE clause is incorrect (0){code}
 

 
||count(1)||
|4|
|0|

 

Expected result:
||count(1)||
|4|
|1|

 

However, altering the forementioned query for the COUNT optimization to come 
off produces expected results:

 
{code:java}
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
union all
select count(1), sum(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- results become correct{code}
 

 
||count(1)||sum(1)||
|4|4|
|1|1|

 

 
{code:java}
select count(1)
from prod_wrk.test_iceberg_v2
where data_field = 'D';
-- result becomes correct as well{code}
 

 
||count(1)||
|1|

 


Count optimization was introduced by 
[IMPALA-11802|https://issues.apache.org/jira/browse/IMPALA-11802]


> COUNT(*) query optimisation for Iceberg V2 tables breaks UNION queries in 
> some cases
> ------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13249
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13249
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.4.0
>            Reporter: Aleksey
>            Priority: Major
>              Labels: impala-iceberg
>
>  
> The problem occurs when _count(*)_ is queried within a UNION upon Iceberg V2 
> table values that had been inserted later on, when the table was already 
> populated with data.
> {code:java}
> create table prod_wrk.test_iceberg_v2
> (
>  id_field integer,
>  data_field string
> )
> stored as iceberg
> TBLPROPERTIES('format-version'='2');
> insert into prod_wrk.test_iceberg_v2 values (1, "A"), (2, "B"), (3, "C");
> -- Prepare table
> delete from prod_wrk.test_iceberg_v2
> where id_field = 1;
> -- Remove first record
> insert into prod_wrk.test_iceberg_v2 values (1, "D"), (4, "E");
> -- Insert new records with data_field values that were not present before
> select *
> from prod_wrk.test_iceberg_v2
> order by data_field;
> -- data_field = 'D' present
> {code}
>  
>  
> ||id_field ||data_field||
> |2|B|
> |3|C|
> |1|D|
> |4|E|
>  
>  
> At this point querying prod_wrk.test_iceberg_v2 is expected to yield one 
> value with _data_field = 'D'_ in any context
>  
> {code:java}
> select count(1)
> from prod_wrk.test_iceberg_v2
> union all
> select count(1)
> from prod_wrk.test_iceberg_v2
> where data_field = 'D';
> -- the result coming from count(1) with the WHERE clause is incorrect 
> (0){code}
>  
>  
> ||count(1)||
> |4|
> |0|
>  
> Expected result:
> ||count(1)||
> |4|
> |1|
>  
> However, altering the forementioned query for the COUNT optimization to come 
> off produces expected results:
>  
> {code:java}
> select count(1), sum(1)
> from prod_wrk.test_iceberg_v2
> union all
> select count(1), sum(1)
> from prod_wrk.test_iceberg_v2
> where data_field = 'D';
> -- results become correct{code}
>  
>  
> ||count(1)||sum(1)||
> |4|4|
> |1|1|
>  
>  
> {code:java}
> select count(1)
> from prod_wrk.test_iceberg_v2
> where data_field = 'D';
> -- result becomes correct as well{code}
>  
>  
> ||count(1)||
> |1|
>  
> Count optimization was introduced by IMPALA-11802



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-13249) COUNT(*) query optimisation for Iceberg V2 tables breaks UNION queries in some cases

Reply via email to