[
https://issues.apache.org/jira/browse/IMPALA-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Noemi Pap-Takacs reassigned IMPALA-13382:
-----------------------------------------
Assignee: Noemi Pap-Takacs
> OPTIMIZE could be more resistant to concurrent write operations
> ---------------------------------------------------------------
>
> Key: IMPALA-13382
> URL: https://issues.apache.org/jira/browse/IMPALA-13382
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Zoltán Borók-Nagy
> Assignee: Noemi Pap-Takacs
> Priority: Major
> Labels: impala-iceberg
>
> When there is a concurrent modification for a data file that is being
> replaced by the OPTIMIZE statement, we can get the following error:
> {noformat}
> Cannot commit, found new delete for replaced data file: ...
> {noformat}
> Because of this we cannot commit OPTIMIZE, meaning all of its work is lost.
> Moreover, the newly written data files remain on storage as orphan files.
> To avoid such conflicts, we could do the followings, before commiting
> OPTIMIZE:
> # Check if there is partition evolution involved in the file replacements.
> If so, let's just hope that the data files associated with deletes are not
> selected by OPTIMIZE, and jump straight to "commit OPTIMIZE". Otherwise:
> # Check if there are new snapshots since the base snapshot of the OPTIMIZE
> statement
> # If there are, then iterate over the snapshots
> # Collect the delete files (possibly via Snapshot.addedDeleteFiles())
> # Collect the set of partitions associated with delete files
> # Filter the file replacements by excluding the affected partitions (all
> have current partition spec)
> ** We can also remove the newly written data files belonging to the affected
> partitions
> # Commit OPTIMIZE
> We need to do this at partition-level granularity as we don't exactly know
> which new data files replace which old files. If partition evolution is
> involved, then we have absolutely no idea which new data files hold the data
> records coming from old partitioning.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]