StephanEwen commented on pull request #15605:
URL: https://github.com/apache/flink/pull/15605#issuecomment-820529852


   The way I understand it, there is no such thing as a reliable RPC. For all 
connection types, we can have one of two models:
   
   1. As soon as the connection fails and we cannot recover it, fail the 
connections and cause both endpoints to handle this as a recovery situation. 
That what happens in the Data Network Stack. That easily handles exactly-once.
   
     - For Tasks that exchange data, that means failover (to a checkpoint) so 
both tasks are consistent with each other again.
     - To apply that between JM and TM means as soon as the connection is lost, 
we need to terminate the RPC connection and cause one side (probably the TM) to 
fully reset all state that depends on RPC data (effectively means restart the 
process). In earlier versions we use the Akka deathwatch detector which kind of 
behaves similar to that, but the impact was pretty severe: There were a lot of 
situations where the TM needed to be restarted to ensure RPC consistency, and 
it had a negative impact on overall system stability.
   
   2. Once we have a connection loss, try to limit the impact. That's what we 
use between JM and TM for RPC. If you can recover the channel, go on again. 
This causes much fewer recoveries in general, but need more delicate failure 
handling. Effectively, we need to think in "sub channels" and as soon as we are 
unclear about the RPC exchange between two subcomponents (like coordinator and 
a specific task), reset only these two, not the entire component that depends 
on the RPC connection (the entire TM). That's what we do here.
   
   Maybe there is an abstraction for these "sub connections" and their failure 
handling, but it would in any case be more work to apply that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to