[ 
https://issues.apache.org/jira/browse/IMPALA-13501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911668#comment-17911668
 ] 

ASF subversion and git services commented on IMPALA-13501:
----------------------------------------------------------

Commit c518d3c8182749b83f938105c605c1b67755513c in impala's branch 
refs/heads/master from Noemi Pap-Takacs
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c518d3c81 ]

IMPALA-13501: Clean up uncommitted Iceberg files after validation check failure

Iceberg supports multiple writers with optimistic concurrency.
Each writer can write new files which are then added to the table
after a validation check to ensure that the commit does not conflict
with other modifications made during the execution.

When there was a conflicting change which could not be resolved, it
means that the newly written files cannot be committed to the table,
so they used to become orphan files on the file system. Orphan files
can accumulate over time, taking up a lot of storage space. They do
not belong to the table because they are not referenced by any snapshot
and therefore they can't be removed by expiring snapshots.

This change introduces automatic cleanup of uncommitted files
after an unsuccessful DML operation to prevent creating orphan files.
No cleanup is done if Iceberg throws CommitStateUnknownException
because the update success or failure is unknown in this case.

Testing:
- E2E test: Injected ValidationException with debug option.
- stress test: Added a method to check that no orphan files were
  created after failed conflicting commits.

Change-Id: Ibe59546ebf3c639b75b53dfa1daba37cef50eb21
Reviewed-on: http://gerrit.cloudera.org:8080/22189
Reviewed-by: Daniel Becker <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Conflicting commits to Iceberg tables leave uncommitted orphan files
> --------------------------------------------------------------------
>
>                 Key: IMPALA-13501
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13501
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Noemi Pap-Takacs
>            Assignee: Noemi Pap-Takacs
>            Priority: Major
>              Labels: impala-iceberg
>
> Iceberg supports multiple writers with optimistic concurrency. Each writer 
> can write new files which are then added to the table after a validation 
> check to ensure that the commit does not conflict with other modifications 
> made during the execution.
> When there was a conflicting change and the newly written files cannot be 
> committed, there are 2 ways to proceed: the commit can be retried and rebased 
> on top of the latest snapshot. If this cannot resolve the conflict, the 
> change cannot be committed and the files become orphan files in the file 
> system.
> It would be nice to remove the remaining files from an unsuccessful commit in 
> one step. Deleting orphan files later as a table maintenance step is also a 
> possible resolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to