zhangyue19921010 opened a new pull request #3920: URL: https://github.com/apache/hudi/pull/3920
https://issues.apache.org/jira/browse/HUDI-2683 ## What is the purpose of the pull request For now, hoodie will use 5s to delete 30 archived commits, even worse for bigger archive threshold like set archive.max_commits 100 or larger. This is because of hoodie deleting archived commits in driver serially. Sometimes, it is unacceptable for Spark Streaming jobs with second level batch interval. We need to delete archived commits in parallel. Here is logs for current master branch ``` 21/11/01 02:39:27,453 INFO HoodieTimelineArchiveLog: Deleting archived instants [[==>20211101013335__commit__REQUESTED], [==>20211101013335__commit__INFLIGHT], [20211101013335__commit__COMPLETED], [==>20211101013541__commit__REQUESTED], [==>20211101013541__commit__INFLIGHT], [20211101013541__commit__COMPLETED], [==>20211101013807__commit__REQUESTED], [==>20211101013807__commit__INFLIGHT], [20211101013807__commit__COMPLETED], [==>20211101014003__commit__REQUESTED], [==>20211101014003__commit__INFLIGHT], [20211101014003__commit__COMPLETED], [==>20211101014152__commit__REQUESTED], [==>20211101014152__commit__INFLIGHT], [20211101014152__commit__COMPLETED], [==>20211101014347__commit__REQUESTED], [==>20211101014347__commit__INFLIGHT], [20211101014347__commit__COMPLETED], [==>20211101014546__commit__REQUESTED], [==>20211101014546__commit__INFLIGHT], [20211101014546__commit__COMPLETED], [==>20211101014756__commit__REQUESTED], [==>20211101014756__commit__INFLIGHT], [20211101014756__commit __COMPLETED], [==>20211101015008__commit__REQUESTED], [==>20211101015008__commit__INFLIGHT], [20211101015008__commit__COMPLETED], [==>20211101015217__commit__REQUESTED], [==>20211101015217__commit__INFLIGHT], [20211101015217__commit__COMPLETED], [==>20211101015449__commit__REQUESTED], [==>20211101015449__commit__INFLIGHT], [20211101015449__commit__COMPLETED]] 21/11/01 02:39:27,453 INFO HoodieTimelineArchiveLog: Deleting instants [[==>20211101013335__commit__REQUESTED], [==>20211101013335__commit__INFLIGHT], [20211101013335__commit__COMPLETED], [==>20211101013541__commit__REQUESTED], [==>20211101013541__commit__INFLIGHT], [20211101013541__commit__COMPLETED], [==>20211101013807__commit__REQUESTED], [==>20211101013807__commit__INFLIGHT], [20211101013807__commit__COMPLETED], [==>20211101014003__commit__REQUESTED], [==>20211101014003__commit__INFLIGHT], [20211101014003__commit__COMPLETED], [==>20211101014152__commit__REQUESTED], [==>20211101014152__commit__INFLIGHT], [20211101014152__commit__COMPLETED], [==>20211101014347__commit__REQUESTED], [==>20211101014347__commit__INFLIGHT], [20211101014347__commit__COMPLETED], [==>20211101014546__commit__REQUESTED], [==>20211101014546__commit__INFLIGHT], [20211101014546__commit__COMPLETED], [==>20211101014756__commit__REQUESTED], [==>20211101014756__commit__INFLIGHT], [20211101014756__commit__COMPLET ED], [==>20211101015008__commit__REQUESTED], [==>20211101015008__commit__INFLIGHT], [20211101015008__commit__COMPLETED], [==>20211101015217__commit__REQUESTED], [==>20211101015217__commit__INFLIGHT], [20211101015217__commit__COMPLETED], [==>20211101015449__commit__REQUESTED], [==>20211101015449__commit__INFLIGHT], [20211101015449__commit__COMPLETED]] 21/11/01 02:39:27,578 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013335.commit.requested 21/11/01 02:39:27,710 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013335.inflight 21/11/01 02:39:27,846 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013335.commit 21/11/01 02:39:27,989 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013541.commit.requested 21/11/01 02:39:28,117 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013541.inflight 21/11/01 02:39:28,249 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013541.commit 21/11/01 02:39:28,428 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013807.commit.requested 21/11/01 02:39:28,605 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013807.inflight 21/11/01 02:39:28,742 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013807.commit 21/11/01 02:39:28,866 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014003.commit.requested 21/11/01 02:39:28,997 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014003.inflight 21/11/01 02:39:29,139 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014003.commit 21/11/01 02:39:29,267 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014152.commit.requested 21/11/01 02:39:29,397 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014152.inflight 21/11/01 02:39:29,519 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014152.commit 21/11/01 02:39:29,646 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014347.commit.requested 21/11/01 02:39:29,789 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014347.inflight 21/11/01 02:39:29,917 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014347.commit 21/11/01 02:39:30,041 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014546.commit.requested 21/11/01 02:39:30,170 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014546.inflight 21/11/01 02:39:30,308 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014546.commit 21/11/01 02:39:30,442 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014756.commit.requested 21/11/01 02:39:30,586 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014756.inflight 21/11/01 02:39:30,751 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014756.commit 21/11/01 02:39:30,883 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015008.commit.requested 21/11/01 02:39:31,356 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015008.inflight 21/11/01 02:39:31,727 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015008.commit 21/11/01 02:39:31,932 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015217.commit.requested 21/11/01 02:39:32,065 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015217.inflight 21/11/01 02:39:32,266 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015217.commit 21/11/01 02:39:32,401 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015449.commit.requested 21/11/01 02:39:32,533 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015449.inflight 21/11/01 02:39:32,888 INFO HoodieTimelineArchiveLog: Archived and deleted instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015449.commit 21/11/01 02:39:32,888 INFO HoodieTimelineArchiveLog: Latest Committed Instant=Option{val=[20211101015449__commit__COMPLETED]} ``` As we can see, hoodie took almost 5 seconds to finish deleting 30 archived commits. After this Patch <img width="1671" alt="屏幕快照 2021-11-02 下午5 05 00" src="https://user-images.githubusercontent.com/69956021/140247280-b4c7c0d8-7d97-44b2-9e97-9c00c329a15d.png"> <img width="1677" alt="屏幕快照 2021-11-02 下午5 05 09" src="https://user-images.githubusercontent.com/69956021/140247321-b5aba1a8-1658-4ba4-b7c6-6611f504844d.png"> It only takes 0.5s to get it done. ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
