[ 
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua updated FLINK-26932:
-----------------------------
    Attachment: 1280X1280.png

> TaskManager hung in cleanupAllocationBaseDirs not exit.
> -------------------------------------------------------
>
>                 Key: FLINK-26932
>                 URL: https://issues.apache.org/jira/browse/FLINK-26932
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: huweihua
>            Priority: Major
>         Attachments: 1280X1280.png, 
> origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in 
> cleanupAllocationBaseDirs and took the main thread.
>  
> So this TaskManager would not respond to the 
> cancelTask/disconnectResourceManager request.
>  
> At the same time, JobMaster already take this TaskManager is lost, and 
> schedule task to other TaskManager.
>  
> This may cause some unexpected task running.
>  
> After checking the log of TaskManager, TM already lost the connection with 
> ResourceManager, and it is always trying to register with ResourceManager. 
> The RegistrationTimeout cannot take effect because the main thread of 
> TaskManager is hung-up.
>  
> I think there are two options to handle it.
> Option 1: Add timeout for 
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal 
> will the concurrency problem
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to