[ 
https://issues.apache.org/jira/browse/RATIS-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059960#comment-18059960
 ] 

Tsz-wo Sze commented on RATIS-2415:
-----------------------------------

{quote}
2. Send request2 → writeAndFlush throws exception, queue=[future1,future2], 
network=[request1]
3. Send request3 → success, queue=[future1,future2,future3], 
network=[request1,request3]
{quote}

[~slfan1989], good catch on the bug!

- client.writeAndFlush only throws AlreadyClosedException.  If it does throw 
AlreadyClosedException, request3 cannot be sent successfully.  So, this case 
seem fine.
- However, if channel.writeAndFlush fails, the current code just ignores 
ChannelFuture failure.  It should complete the reply exceptionally and close 
the client.

Will comment on the PR.

>  Fix queue corruption in NettyRpcProxy when request sending fails
> -----------------------------------------------------------------
>
>                 Key: RATIS-2415
>                 URL: https://issues.apache.org/jira/browse/RATIS-2415
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Shilun Fan
>            Assignee: Shilun Fan
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> *Summary*
> NettyRpcProxy.Connection.offer() has a bug where a CompletableFuture is 
> added to the replies queue before calling writeAndFlush(). If writeAndFlush() 
> throws an AlreadyClosedException (or fails asynchronously), the future 
> remains 
> in the queue, causing memory leaks and reply mismatches.
>  
> *Root Cause*
> {code:java}
> synchronized ChannelFuture offer(...) {
>     replies.offer(reply); // Step 1: enqueue
>     return client.writeAndFlush(request); // Step 2: may throw exception
> } {code}
> If Step 2 fails, Step 1 is not rolled back, leaving the queue corrupted.
> *Reproduction Senario*
> 1. Send request1 → success, queue=[future1], network=[request1]
> 2. Send request2 → writeAndFlush throws exception, queue=[future1,future2], 
> network=[request1]
> 3. Send request3 → success, queue=[future1,future2,future3], 
> network=[request1,request3]
> 4. Server returns response1, response3
> 5. Client receives response1 → pollReply() gets future1 ✅
> 6. Client receives response3 → pollReply() gets future2 ❌ (mismatch!)
> 7. future3 never completes (timeout)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to