Zoltán Borók-Nagy created IMPALA-13382: ------------------------------------------
Summary: OPTIMIZE could be more resistant to concurrent write operations Key: IMPALA-13382 URL: https://issues.apache.org/jira/browse/IMPALA-13382 Project: IMPALA Issue Type: Bug Components: Catalog Reporter: Zoltán Borók-Nagy When there is a concurrent modification for a data file that is being replaced by the OPTIMIZE statement, we can get the following error: {noformat} Cannot commit, found new delete for replaced data file: ... {noformat} Because of this we cannot commit OPTIMIZE, meaning all of its work is lost. Moreover, the newly written data files remain on storage as orphan files. To avoid such conflicts, we could do the followings, before commiting OPTIMIZE: # Check if there are new snapshots since the base snapshot of the OPTIMIZE statement # If there are, then iterate over the snapshots # Collect the delete files (possibly via Snapshot.addedDeleteFiles()) # Collect the set of partitions associated with delete files #* If there are partitions with old spec, then don't filter the replacements (as we don't know what is replaced by what, so let's just hope that the data files associated with deletes are not selected by OPTIMIZE) #* Otherwise filter the file replacements by excluding the affected partitions (all have current partition spec) #** We can also remove the newly written data files belonging to the affected partitions # Commit OPTIMIZE We need to do this at partition-level granularity as we don't exactly know which new data files replace which old files. If partition evolution is involved, then we have absolutely no idea which new data files hold the data records coming from old partitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org