[ 
https://issues.apache.org/jira/browse/RATIS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tian Jiang updated RATIS-2314:
------------------------------
    Attachment: image-2025-07-01-10-47-16-439.png

> Deadlock when appending large batches of entries
> ------------------------------------------------
>
>                 Key: RATIS-2314
>                 URL: https://issues.apache.org/jira/browse/RATIS-2314
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Tian Jiang
>            Priority: Major
>         Attachments: image-2025-06-30-13-25-54-574.png, 
> image-2025-06-30-13-33-13-766.png, image-2025-06-30-13-35-09-832.png, 
> image-2025-06-30-13-37-28-743.png, image-2025-07-01-10-43-50-136.png, 
> image-2025-07-01-10-44-53-302.png, image-2025-07-01-10-46-04-453.png, 
> image-2025-07-01-10-47-16-439.png, image-2025-07-01-10-47-20-566.png, 
> image-2025-07-01-10-47-54-166.png
>
>
> Greeting, I found a deadlock when I am using Ratis, which is shown in the 
> following stack:
> !image-2025-06-30-13-25-54-574.png!
> The problem is that Ratis uses CompletableFuture to serialize the process of 
> appending logs, and below is the timeline:
>  # Thread "SegmentedRaftLogWorker" is working on some logs;
>  # Thread "SomeClient" tries to append 5000 entries to the log;
>  # Because "SegmentedRaftLogWorker" is working, "SomeClient" adds the entries 
> to the CompletableFuture; !image-2025-06-30-13-33-13-766.png!
>  # When "SegmentedRaftLogWorker" is done with its last task, it checks the 
> CompletableFuture for the next task; !image-2025-06-30-13-35-09-832.png!
>  # Of course, there is one from "SomeClient", ordering it to append another 
> 5000 entries; therefore "SegmentedRaftLogWorker" continues to append the 
> entries, and it comes to:  !image-2025-06-30-13-37-28-743.png!
>  # However, the queue of SegmentedRaftLogWorker has a max capacity of 4096, 
> and there are 5000 entries to be appended, so Thread "SegmentedRaftLogWorker" 
> waits for the queue to be not full;
>  # But who is supposed to consume from the queue? Thread 
> "SegmentedRaftLogWorker"  itself! As a result, the wait shall never end.
> I notice that Ratis is using many thread-steal tricks to improve thread 
> locality, which is appreciated. However, it seems such a technique, in this 
> scenario, creates deadlocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to