[ 
https://issues.apache.org/jira/browse/FLINK-29985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632408#comment-17632408
 ] 

Roman Khachatryan commented on FLINK-29985:
-------------------------------------------

I found that TM actually does try to release the slot table, but there are two 
hard-coded timeouts:
1. 5s JvmShutdownSafeguard
2. 10s in flink-daemon.sh

[cluster.services.shutdown-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#cluster-services-shutdown-timeout]
 is not taken into account by TM.

Another issue is that even if the above timeouts don't fire, the logging system 
is stopped prematurely.
That's why SlotTable release can be is silent.


> TaskManager doesn't close SlotTable on SIGTERM
> ----------------------------------------------
>
>                 Key: FLINK-29985
>                 URL: https://issues.apache.org/jira/browse/FLINK-29985
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.16.0, 1.15.3
>            Reporter: Roman Khachatryan
>            Priority: Major
>
> When TM is stopped by RM, its slot table is closed, causing all its slots to 
> be released.
> However, when TM is stopped by SIGTERM (i.e. external resource manager), its 
> slot table is NOT closed.
>  
> When a slot is released, the associated resources are released as well, in 
> particular, MemoryManager.
> MemoryManager might hold not only memory, but also arbitrary shared resources 
> (currently, PythonSharedResources and RocksDBSharedResources).
> As of now, RocksDBSharedResources contains only ephemeral resources. Not sure 
> about PythonSharedResources, but likely it is associated with a separate 
> process.
> That means that in standalone clusters, some resources might not be released.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to