zhangyue19921010 opened a new pull request #3920:
URL: https://github.com/apache/hudi/pull/3920


   https://issues.apache.org/jira/browse/HUDI-2683
   ## What is the purpose of the pull request
   For now, hoodie will use 5s to delete 30 archived commits, even worse for 
bigger archive threshold like set archive.max_commits 100 or larger.
   
   This is because of hoodie deleting archived commits in driver serially.
   
   Sometimes, it is unacceptable for Spark Streaming jobs with second level 
batch interval.
   
   We need to delete archived commits in parallel.
   
   
   
   Here is logs for current master branch
   ```
   21/11/01 02:39:27,453 INFO HoodieTimelineArchiveLog: Deleting archived 
instants [[==>20211101013335__commit__REQUESTED], 
[==>20211101013335__commit__INFLIGHT], [20211101013335__commit__COMPLETED], 
[==>20211101013541__commit__REQUESTED], [==>20211101013541__commit__INFLIGHT], 
[20211101013541__commit__COMPLETED], [==>20211101013807__commit__REQUESTED], 
[==>20211101013807__commit__INFLIGHT], [20211101013807__commit__COMPLETED], 
[==>20211101014003__commit__REQUESTED], [==>20211101014003__commit__INFLIGHT], 
[20211101014003__commit__COMPLETED], [==>20211101014152__commit__REQUESTED], 
[==>20211101014152__commit__INFLIGHT], [20211101014152__commit__COMPLETED], 
[==>20211101014347__commit__REQUESTED], [==>20211101014347__commit__INFLIGHT], 
[20211101014347__commit__COMPLETED], [==>20211101014546__commit__REQUESTED], 
[==>20211101014546__commit__INFLIGHT], [20211101014546__commit__COMPLETED], 
[==>20211101014756__commit__REQUESTED], [==>20211101014756__commit__INFLIGHT], 
[20211101014756__commit
 __COMPLETED], [==>20211101015008__commit__REQUESTED], 
[==>20211101015008__commit__INFLIGHT], [20211101015008__commit__COMPLETED], 
[==>20211101015217__commit__REQUESTED], [==>20211101015217__commit__INFLIGHT], 
[20211101015217__commit__COMPLETED], [==>20211101015449__commit__REQUESTED], 
[==>20211101015449__commit__INFLIGHT], [20211101015449__commit__COMPLETED]]
   21/11/01 02:39:27,453 INFO HoodieTimelineArchiveLog: Deleting instants 
[[==>20211101013335__commit__REQUESTED], [==>20211101013335__commit__INFLIGHT], 
[20211101013335__commit__COMPLETED], [==>20211101013541__commit__REQUESTED], 
[==>20211101013541__commit__INFLIGHT], [20211101013541__commit__COMPLETED], 
[==>20211101013807__commit__REQUESTED], [==>20211101013807__commit__INFLIGHT], 
[20211101013807__commit__COMPLETED], [==>20211101014003__commit__REQUESTED], 
[==>20211101014003__commit__INFLIGHT], [20211101014003__commit__COMPLETED], 
[==>20211101014152__commit__REQUESTED], [==>20211101014152__commit__INFLIGHT], 
[20211101014152__commit__COMPLETED], [==>20211101014347__commit__REQUESTED], 
[==>20211101014347__commit__INFLIGHT], [20211101014347__commit__COMPLETED], 
[==>20211101014546__commit__REQUESTED], [==>20211101014546__commit__INFLIGHT], 
[20211101014546__commit__COMPLETED], [==>20211101014756__commit__REQUESTED], 
[==>20211101014756__commit__INFLIGHT], [20211101014756__commit__COMPLET
 ED], [==>20211101015008__commit__REQUESTED], 
[==>20211101015008__commit__INFLIGHT], [20211101015008__commit__COMPLETED], 
[==>20211101015217__commit__REQUESTED], [==>20211101015217__commit__INFLIGHT], 
[20211101015217__commit__COMPLETED], [==>20211101015449__commit__REQUESTED], 
[==>20211101015449__commit__INFLIGHT], [20211101015449__commit__COMPLETED]]
   21/11/01 02:39:27,578 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013335.commit.requested
   21/11/01 02:39:27,710 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013335.inflight
   21/11/01 02:39:27,846 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013335.commit
   21/11/01 02:39:27,989 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013541.commit.requested
   21/11/01 02:39:28,117 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013541.inflight
   21/11/01 02:39:28,249 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013541.commit
   21/11/01 02:39:28,428 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013807.commit.requested
   21/11/01 02:39:28,605 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013807.inflight
   21/11/01 02:39:28,742 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101013807.commit
   21/11/01 02:39:28,866 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014003.commit.requested
   21/11/01 02:39:28,997 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014003.inflight
   21/11/01 02:39:29,139 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014003.commit
   21/11/01 02:39:29,267 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014152.commit.requested
   21/11/01 02:39:29,397 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014152.inflight
   21/11/01 02:39:29,519 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014152.commit
   21/11/01 02:39:29,646 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014347.commit.requested
   21/11/01 02:39:29,789 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014347.inflight
   21/11/01 02:39:29,917 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014347.commit
   21/11/01 02:39:30,041 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014546.commit.requested
   21/11/01 02:39:30,170 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014546.inflight
   21/11/01 02:39:30,308 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014546.commit
   21/11/01 02:39:30,442 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014756.commit.requested
   21/11/01 02:39:30,586 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014756.inflight
   21/11/01 02:39:30,751 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101014756.commit
   21/11/01 02:39:30,883 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015008.commit.requested
   21/11/01 02:39:31,356 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015008.inflight
   21/11/01 02:39:31,727 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015008.commit
   21/11/01 02:39:31,932 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015217.commit.requested
   21/11/01 02:39:32,065 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015217.inflight
   21/11/01 02:39:32,266 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015217.commit
   21/11/01 02:39:32,401 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file 
s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015449.commit.requested
   21/11/01 02:39:32,533 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015449.inflight
   21/11/01 02:39:32,888 INFO HoodieTimelineArchiveLog: Archived and deleted 
instant file s3a://xx-xx-xxx/xx/xx/xxxxxx/.hoodie/20211101015449.commit
   21/11/01 02:39:32,888 INFO HoodieTimelineArchiveLog: Latest Committed 
Instant=Option{val=[20211101015449__commit__COMPLETED]}
   ```
   
   As we can see, hoodie took almost 5 seconds to finish deleting 30 archived 
commits.
   
   After this Patch
   
   
   <img width="1671" alt="屏幕快照 2021-11-02 下午5 05 00" 
src="https://user-images.githubusercontent.com/69956021/140247280-b4c7c0d8-7d97-44b2-9e97-9c00c329a15d.png";>
   
   <img width="1677" alt="屏幕快照 2021-11-02 下午5 05 09" 
src="https://user-images.githubusercontent.com/69956021/140247321-b5aba1a8-1658-4ba4-b7c6-6611f504844d.png";>
   
   It only takes 0.5s to get it done.
   
   
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to