Github user tillrohrmann commented on the issue:
https://github.com/apache/flink/pull/2410
Hi @beyond1920,
first let me comment on the different points you've raised.
1. I'm actually not sure whether the exponential backoff strategy makes
sense. The `maxHeartbeatTimeout` won't solve the underlying problem as you
timeout can then be up to `2*maxHeartbeatTimeout`. I guess the best approach
would be to have a periodic heartbeat interval which is adaptive wrt load. But
I'm not sure here.
2. It's correct that the `HeartbeatManager` starts the
`HeartbeatScheduler`. However, from then on, the `HeartbeatScheduler` are
responsible for the triggering. In order to control the heartbeat load you
would need to coordinate it on the `HeartbeatManager`-level.
3. Sure, but the question is whether the RM, TM and JM have to implement an
interface for it (adding it to their RPC contract) or whether we have an
independent component responsible for the heart beating. Furthermore, there is
a difference if you reply to an RPC call (give the result back as a future) or
whether you send yourself an RPC to the sender. The former has the problem that
timed out futures will be ignored and thus all the work you've done on the
receiver side has to be re-done for the next heartbeat trigger message.
4. I'm actually not so sure whether the components are so different.
Logically the receiving and sending side should have the same heartbeat timeout
detection logic. The only difference should be that the sender initiates the
heartbeat and the receiver answers to it. One could also think about having on
each side an heartbeat sender instance. Anyway, I think that the receiving and
sending side should be developed together and not considered separate parts.
I think we should first concentrate on these questions to agree on a design
for the heartbeat component. This should happen in the JIRA issue.
Furthermore, I think you still have to clean up the PR because it contains
unrelated changes. So best if you close it and then we concentrate first on the
design. Once this is done, you can open a new PR with the implementation.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---