Shammon created FLINK-25338:
-------------------------------
Summary: Improvement of connection from TM to JM in session cluster
Key: FLINK-25338
URL: https://issues.apache.org/jira/browse/FLINK-25338
Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination
Affects Versions: 1.14.2, 1.13.5, 1.12.7
Reporter: Shammon
When taskmanager receives slot request from resourcemanager for the specify
job, it will connect to the jobmaster with given job address. Taskmanager
register itself, monitor the heartbeat of job and update task's state by this
connection. There's no need to create connections in one taskmanager for each
job, and when the taskmanager is busy, it will increase the latency of job.
One idea is that taskmanager manages the connection to `Dispatcher`, sends
events such as heartbeat, state update to `Dispatcher`, and `Dispatcher` tell
the local `JobMaster`. The main problem is that `Dispatcher` is an actor and
can only be executed in one thread, it may be the performance bottleneck for
deserialize event.
The other idea it to create a netty service in `SessionClusterEntrypoint`, it
can receive and deserialize events from taskmanagers in a threadpool, and send
the event to the `Dispatcher` or `JobMaster`. Taskmanagers manager the
connection to the netty service when it start. Thus a service can also receive
the result of a job from taskmanager later.
[~xtsong] What do you think? THX
--
This message was sent by Atlassian Jira
(v8.20.1#820001)