[
https://issues.apache.org/jira/browse/FLINK-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhu Zhu updated FLINK-13962:
----------------------------
Summary: Task state handles leak if the task fails before deploying (was:
Execution#taskRestore leaks if task fails before deploying)
> Task state handles leak if the task fails before deploying
> ----------------------------------------------------------
>
> Key: FLINK-13962
> URL: https://issues.apache.org/jira/browse/FLINK-13962
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Zhu Zhu
> Priority: Major
>
> Currently Execution#taskRestore is reset to null in task deployment stage.
> The purpose of it is "allows the JobManagerTaskRestore instance to be garbage
> collected. Furthermore, it won't be archived along with the Execution in the
> ExecutionVertex in case of a restart. This is especially important when
> setting state.backend.fs.memory-threshold to larger values because every
> state below this threshold will be stored in the meta state files and, thus,
> also the JobManagerTaskRestore instances." (From FLINK-9693)
>
> However, if a task fails before it comes to the deployment stage(e.g. fails
> due to slot allocation timeout), the Execution#taskRestore will remain
> non-null and will be archived in prior executions.
> This may result in large JM heap cost in certain cases and lead to continuous
> JM full GCs.
>
> I’d propose to set the taskRestore field to be null before moving an
> execution to prior executions.
> We can keep the logic which set the taskRestore field to be null after task
> deployment to allow the
--
This message was sent by Atlassian Jira
(v8.3.2#803003)