Zhu Zhu created FLINK-13962:
-------------------------------
Summary: Execution#taskRestore leaks if task fails before deploying
Key: FLINK-13962
URL: https://issues.apache.org/jira/browse/FLINK-13962
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.9.0, 1.10.0
Reporter: Zhu Zhu
Currently Execution#taskRestore is reset to null in task deployment stage.
The purpose of it is "allows the JobManagerTaskRestore instance to be garbage
collected. Furthermore, it won't be archived along with the Execution in the
ExecutionVertex in case of a restart. This is especially important when setting
state.backend.fs.memory-threshold to larger values because every state below
this threshold will be stored in the meta state files and, thus, also the
JobManagerTaskRestore instances." (From FLINK-9693)
However, if a task fails before it comes to the deployment stage, the
Execution#taskRestore will remain non-null and will be archived in prior
executions.
This may result in large JM heap cost in certain cases.
I think we should check the Execution#taskRestore and make sure it is null when
moving a execution to prior executions.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)