[
https://issues.apache.org/jira/browse/HUDI-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-6213:
---------------------------------
Labels: pull-request-available (was: )
> Parallelize deletion of files during rollback.
> ----------------------------------------------
>
> Key: HUDI-6213
> URL: https://issues.apache.org/jira/browse/HUDI-6213
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
> Labels: pull-request-available
>
> Assume we are rolling back a commit with large number of files (1k+) in a
> partition
> *Current strategy:*
> For each partition, create a rollback request which contains the list of all
> the files to be deleted from that partition. Since each rollback request is
> executed on an executor, in this model an executor would be deleting the 1K+
> files sequentially. This is slow and does not take advantage of the rollback
> parallelism or presence of multiple executors.
> *Changed strategy:*
> Each rollback request should only contain a single file to be deleted from a
> partition. Since each rollback request is executed on an executor, in this
> model 1k+ tasks will be executed in parallel on the available executors. This
> will speed up the deletion part of the rollback.
>
> We have several datasets where the number of files inserted are in 90K+ per
> commit. So for rolling back failed commits it takes hours. With this change
> it takes minutes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)