[jira] [Commented] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files

ASF subversion and git services (Jira) Wed, 13 Mar 2024 14:18:04 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826868#comment-17826868
 ]


ASF subversion and git services commented on IMPALA-12894:
----------------------------------------------------------

Commit ada4090e0989805ed884e135356c6b688e7ccc96 in impala's branch 
refs/heads/master from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ada4090e0 ]

IMPALA-12894: (part 1) Turn off the count(*) optimisation for V2 Iceberg tables

This is a part 1 change that turns off the count(*) optimisations for
V2 tables as there is a correctness issue with it. The reason is that
Spark compaction may leave some dangling delete files that mess up
the logic in Impala.

Change-Id: Ida9fb04fd076c987b6b5257ad801bf30f5900237
Reviewed-on: http://gerrit.cloudera.org:8080/21139
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Optimized count(*) for Iceberg gives wrong results after a Spark 
> rewrite_data_files
> -----------------------------------------------------------------------------------
>
>                 Key: IMPALA-12894
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12894
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.3.0
>            Reporter: Gabor Kaszab
>            Priority: Critical
>              Labels: correctness, impala-iceberg
>         Attachments: count_star_correctness_repro.tar.gz
>
>
> Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 
> that implemented an optimized way to get results for count(*). However, if 
> the table was compacted by Spark this optimization can give incorrect results.
> The reason is that Spark can[ skip dropping delete 
> files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files]
>  that are pointing to compacted data files, as a result there might be delete 
> files after compaction that are no longer applied to any data files.
> Repro:
> With Impala
> {code:java}
> create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG
> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>               'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/',
>               'iceberg.table_identifier'='iceberg_testing',
>               'format-version'='2');
> insert into iceberg_testing values
>     (1, 1), (2, 4), (3, 9), (4, 16), (5, 25);
> update iceberg_testing set j = -100 where id = 4;
> delete from iceberg_testing where id = 4;{code}
> Count * returns 4 at this point.
> Run compaction in Spark:
> {code:java}
> spark.sql(s"CALL local.system.rewrite_data_files(table => 
> 'default.iceberg_testing', options => map('min-input-files','2') )").show() 
> {code}
> Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). 
> Hive returns correct results. Also a SELECT * returns correct results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files

Reply via email to