[ 
https://issues.apache.org/jira/browse/IMPALA-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noemi Pap-Takacs reassigned IMPALA-13382:
-----------------------------------------

    Assignee: Noemi Pap-Takacs

> OPTIMIZE could be more resistant to concurrent write operations
> ---------------------------------------------------------------
>
>                 Key: IMPALA-13382
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13382
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Noemi Pap-Takacs
>            Priority: Major
>              Labels: impala-iceberg
>
> When there is a concurrent modification for a data file that is being 
> replaced by the OPTIMIZE statement, we can get the following error:
> {noformat}
> Cannot commit, found new delete for replaced data file: ...
> {noformat}
>  Because of this we cannot commit OPTIMIZE, meaning all of its work is lost. 
> Moreover, the newly written data files remain on storage as orphan files.
> To avoid such conflicts, we could do the followings, before commiting 
> OPTIMIZE:
>  # Check if there is partition evolution involved in the file replacements. 
> If so, let's just hope that the data files associated with deletes are not 
> selected by OPTIMIZE, and jump straight to "commit OPTIMIZE". Otherwise:
>  # Check if there are new snapshots since the base snapshot of the OPTIMIZE 
> statement
>  # If there are, then iterate over the snapshots
>  # Collect the delete files (possibly via Snapshot.addedDeleteFiles())
>  # Collect the set of partitions associated with delete files
>  # Filter the file replacements by excluding the affected partitions (all 
> have current partition spec)
>  ** We can also remove the newly written data files belonging to the affected 
> partitions
>  # Commit OPTIMIZE
> We need to do this at partition-level granularity as we don't exactly know 
> which new data files replace which old files. If partition evolution is 
> involved, then we have absolutely no idea which new data files hold the data 
> records coming from old partitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to