[GitHub] [flink] kezhuw commented on pull request #15605: [FLINK-21996][coordination] - Part 3&4: Ensure OperatorEvent transport losses are handled

GitBox Wed, 14 Apr 2021 07:02:51 -0700


kezhuw commented on pull request #15605:
URL: https://github.com/apache/flink/pull/15605#issuecomment-819542959



   To be honest, the initial hit off the top of my head when receiving 
FLINK-21996 is that are we build an unreliable rpc channel ? Then I realized 
that `AkkaOptions.ASK_TIMEOUT`. I wonder whether we could solve this by 
specifying a timeout for `TaskExecutorGateway.sendOperatorEventToTask` much 
larger than `HeartbeatManagerOptions.HEARTBEAT_INTERVAL`. This way, if 
"received" future fails, it means task manager is already considered as down by 
heartbeat manager. Is there anything wrong or are we just paranoid here to 
unknown errors ? It might be caused by my few knowledge of akka. I assumed akka 
messaging is reliable(eg. ordered messaging, delivery failure will timeout 
heartbeat finally). @StephanEwen  @tillrohrmann 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] kezhuw commented on pull request #15605: [FLINK-21996][coordination] - Part 3&4: Ensure OperatorEvent transport losses are handled

Reply via email to