[
https://issues.apache.org/jira/browse/RATIS-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059960#comment-18059960
]
Tsz-wo Sze commented on RATIS-2415:
-----------------------------------
{quote}
2. Send request2 → writeAndFlush throws exception, queue=[future1,future2],
network=[request1]
3. Send request3 → success, queue=[future1,future2,future3],
network=[request1,request3]
{quote}
[~slfan1989], good catch on the bug!
- client.writeAndFlush only throws AlreadyClosedException. If it does throw
AlreadyClosedException, request3 cannot be sent successfully. So, this case
seem fine.
- However, if channel.writeAndFlush fails, the current code just ignores
ChannelFuture failure. It should complete the reply exceptionally and close
the client.
Will comment on the PR.
> Fix queue corruption in NettyRpcProxy when request sending fails
> -----------------------------------------------------------------
>
> Key: RATIS-2415
> URL: https://issues.apache.org/jira/browse/RATIS-2415
> Project: Ratis
> Issue Type: Bug
> Reporter: Shilun Fan
> Assignee: Shilun Fan
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> *Summary*
> NettyRpcProxy.Connection.offer() has a bug where a CompletableFuture is
> added to the replies queue before calling writeAndFlush(). If writeAndFlush()
> throws an AlreadyClosedException (or fails asynchronously), the future
> remains
> in the queue, causing memory leaks and reply mismatches.
>
> *Root Cause*
> {code:java}
> synchronized ChannelFuture offer(...) {
> replies.offer(reply); // Step 1: enqueue
> return client.writeAndFlush(request); // Step 2: may throw exception
> } {code}
> If Step 2 fails, Step 1 is not rolled back, leaving the queue corrupted.
> *Reproduction Senario*
> 1. Send request1 → success, queue=[future1], network=[request1]
> 2. Send request2 → writeAndFlush throws exception, queue=[future1,future2],
> network=[request1]
> 3. Send request3 → success, queue=[future1,future2,future3],
> network=[request1,request3]
> 4. Server returns response1, response3
> 5. Client receives response1 → pollReply() gets future1 ✅
> 6. Client receives response3 → pollReply() gets future2 ❌ (mismatch!)
> 7. future3 never completes (timeout)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)