Prashant Wason created HUDI-6213:
------------------------------------
Summary: Parallelize deletion of files during rollback.
Key: HUDI-6213
URL: https://issues.apache.org/jira/browse/HUDI-6213
Project: Apache Hudi
Issue Type: Improvement
Reporter: Prashant Wason
Assume we are rolling back a commit with large number of files (1k+) in a
partition
*Current strategy:*
For each partition, create a rollback request which contains the list of all
the files to be deleted from that partition. Since each rollback request is
executed on an executor, in this model an executor would be deleting the 1K+
files sequentially. This is slow and does not take advantage of the rollback
parallelism or presence of multiple executors.
*Changed strategy:*
Each rollback request should only contain a single file to be deleted from a
partition. Since each rollback request is executed on an executor, in this
model 1k+ tasks will be executed in parallel on the available executors. This
will speed up the deletion part of the rollback.
We have several datasets where the number of files inserted are in 90K+ per
commit. So for rolling back failed commits it takes hours. With this change it
takes minutes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)