[ https://issues.apache.org/jira/browse/RATIS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987857#comment-17987857 ]
Tian Jiang commented on RATIS-2314: ----------------------------------- [~szetszwo] [[RATIS-2314] Fix that SegmentedRaftLogWorker may append entry by itself by jt2594838 · Pull Request #1274 · apache/ratis|https://github.com/apache/ratis/pull/1274] PTAL > Deadlock when appending large batches of entries > ------------------------------------------------ > > Key: RATIS-2314 > URL: https://issues.apache.org/jira/browse/RATIS-2314 > Project: Ratis > Issue Type: Bug > Components: server > Reporter: Tian Jiang > Priority: Major > Attachments: image-2025-06-30-13-25-54-574.png, > image-2025-06-30-13-33-13-766.png, image-2025-06-30-13-35-09-832.png, > image-2025-06-30-13-37-28-743.png, image-2025-07-01-10-43-50-136.png, > image-2025-07-01-10-44-53-302.png, image-2025-07-01-10-46-04-453.png, > image-2025-07-01-10-47-16-439.png, image-2025-07-01-10-47-20-566.png, > image-2025-07-01-10-47-54-166.png, image-2025-07-01-10-48-56-806.png, > image-2025-07-01-10-52-05-896.png, image-2025-07-01-11-05-10-982.png > > Time Spent: 10m > Remaining Estimate: 0h > > Greeting, I found a deadlock when I am using Ratis, which is shown in the > following stack: > !image-2025-06-30-13-25-54-574.png! > The problem is that Ratis uses CompletableFuture to serialize the process of > appending logs, and below is the timeline: > # Thread "SegmentedRaftLogWorker" is working on some logs; > # Thread "SomeClient" tries to append 5000 entries to the log; > # Because "SegmentedRaftLogWorker" is working, "SomeClient" adds the entries > to the CompletableFuture; !image-2025-06-30-13-33-13-766.png! > # When "SegmentedRaftLogWorker" is done with its last task, it checks the > CompletableFuture for the next task; !image-2025-06-30-13-35-09-832.png! > # Of course, there is one from "SomeClient", ordering it to append another > 5000 entries; therefore "SegmentedRaftLogWorker" continues to append the > entries, and it comes to: !image-2025-06-30-13-37-28-743.png! > # However, the queue of SegmentedRaftLogWorker has a max capacity of 4096, > and there are 5000 entries to be appended, so Thread "SegmentedRaftLogWorker" > waits for the queue to be not full; > # But who is supposed to consume from the queue? Thread > "SegmentedRaftLogWorker" itself! As a result, the wait shall never end. > I notice that Ratis is using many thread-steal tricks to improve thread > locality, which is appreciated. However, it seems such a technique, in this > scenario, creates deadlocks. -- This message was sent by Atlassian Jira (v8.20.10#820010)