[jira] [Created] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files

Gabor Kaszab (Jira) Tue, 12 Mar 2024 07:09:06 -0700

Gabor Kaszab created IMPALA-12894:
-------------------------------------

             Summary: Optimized count(*) for Iceberg gives wrong results after 
a Spark rewrite_data_files
                 Key: IMPALA-12894
                 URL: https://issues.apache.org/jira/browse/IMPALA-12894
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 4.3.0
            Reporter: Gabor Kaszab



Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 that 
implemented an optimized way to get results for count(*). However, if the table 
was compacted by Spark this optimization can give incorrect results.

The reason is that Spark can[ skip dropping delete 
files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files]
 that are pointing to compacted data files, as a result there might be delete 
files after compaction that are no longer applied to any data files.

Repro:

With Impala
{code:java}
create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
              'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/',
              'iceberg.table_identifier'='iceberg_testing',
              'format-version'='2');
insert into iceberg_testing values
    (1, 1), (2, 4), (3, 9), (4, 16), (5, 25);
update iceberg_testing set j = -100 where id = 4;
delete from iceberg_testing where id = 4;{code}
Count * returns 4 at this point.

Run compaction in Spark:
{code:java}
spark.sql(s"CALL local.system.rewrite_data_files(table => 
'default.iceberg_testing', options => map('min-input-files','2') )").show() 
{code}
Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). Hive 
returns correct results. Also a SELECT * returns correct results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files

Reply via email to