He-Pin opened a new issue, #3126:
URL: https://github.com/apache/pekko/issues/3126

   ### Motivation
   
   In the classic remoting transport (deprecated but still shipped), 
`AckedSendBuffer.acknowledge` throws an `IllegalArgumentException` when it 
receives an ACK with a `cumulativeAck` higher than `maxSeq`. Under normal 
operation this would indicate a protocol invariant violation. However, a 
**stale ACK from a previous association** can legitimately trigger this 
condition after a transient network disruption, leading to an irrecoverable 
quarantine.
   
   ### Scenario
   
   1. Node A sends system messages (seq 0, 1, 2) to Node B. `maxSeq = 2`.
   2. Node B receives all messages and sends `ACK(2)`, but the ACK is delayed 
in the network.
   3. A transient network disruption causes the connection to drop.
   4. Node B restarts with a new UID.
   5. A new association is established. Node A receives the new UID and calls 
`reset()`, resetting `maxSeq` to `-1`.
   6. The delayed `ACK(2)` from the old session arrives.
   7. `ack.cumulativeAck (2) > maxSeq (-1)` triggers the 
`IllegalArgumentException`.
   8. The exception is caught and wrapped as `HopelessAssociation`, which 
triggers quarantine.
   9. Node B is quarantined for days (configurable via `quarantine-duration`), 
requiring manual intervention or full cluster restart to recover.
   
   ### Impact
   
   - A transient network disruption followed by a remote node restart can cause 
an **irrecoverable quarantine** lasting days.
   - This affects only **classic remoting** (deprecated). Artery transport 
handles stale ACKs correctly by silently ignoring them.
   - While classic remoting is deprecated, it is still compiled, shipped, and 
used by some deployments.
   
   ### Fix
   
   When `cumulativeAck > maxSeq`, the buffer should treat it as a no-op (return 
the current buffer unchanged) rather than throwing. A warning log should be 
emitted in the endpoint when a stale ACK is detected, to aid in debugging.
   
   ### References
   
   This same bug was identified and fixed in the Akka.NET project: 
https://github.com/akkadotnet/akka.net/issues/8116


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to