Prashant Wason created HUDI-6213:
------------------------------------

             Summary: Parallelize deletion of files during rollback.
                 Key: HUDI-6213
                 URL: https://issues.apache.org/jira/browse/HUDI-6213
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason


Assume we are rolling back a commit with large number of files (1k+) in a 
partition

*Current strategy:*
For each partition, create a rollback request which contains the list of all 
the files to be deleted from that partition. Since each rollback request is 
executed on an executor, in this model an executor would be deleting the 1K+ 
files sequentially. This is slow and does not take advantage of the rollback 
parallelism or presence of multiple executors.

*Changed strategy:*
Each rollback request should only contain a single file to be deleted from a 
partition. Since each rollback request is executed on an executor, in this 
model 1k+ tasks will be executed in parallel on the available executors. This 
will speed up the deletion part of the rollback.

 

We have several datasets where the number of files inserted are in 90K+ per 
commit. So for rolling back failed commits it takes hours. With this change it 
takes minutes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to