Zoltán Borók-Nagy created IMPALA-13382:
------------------------------------------

             Summary: OPTIMIZE could be more resistant to concurrent write 
operations
                 Key: IMPALA-13382
                 URL: https://issues.apache.org/jira/browse/IMPALA-13382
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog
            Reporter: Zoltán Borók-Nagy


When there is a concurrent modification for a data file that is being replaced 
by the OPTIMIZE statement, we can get the following error:
{noformat}
Cannot commit, found new delete for replaced data file: ...
{noformat}
 Because of this we cannot commit OPTIMIZE, meaning all of its work is lost. 
Moreover, the newly written data files remain on storage as orphan files.

To avoid such conflicts, we could do the followings, before commiting OPTIMIZE:
 # Check if there are new snapshots since the base snapshot of the OPTIMIZE 
statement
 # If there are, then iterate over the snapshots
 # Collect the delete files (possibly via Snapshot.addedDeleteFiles())
 # Collect the set of partitions associated with delete files
 #* If there are partitions with old spec, then don't filter the replacements 
(as we don't know what is replaced by what, so let's just hope that the data 
files associated with deletes are not selected by OPTIMIZE)
 #* Otherwise filter the file replacements by excluding the affected partitions 
(all have current partition spec)
#** We can also remove the newly written data files belonging to the affected 
partitions
 # Commit OPTIMIZE

We need to do this at partition-level granularity as we don't exactly know 
which new data files replace which old files. If partition evolution is 
involved, then we have absolutely no idea which new data files hold the data 
records coming from old partitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to