[
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533316#comment-17533316
]
Atri Sharma commented on FLINK-26932:
-------------------------------------
Is this ticket open? I am looking to start contributing to Flink and wondering
if this is a good spot?
> TaskManager hung in cleanupAllocationBaseDirs not exit.
> -------------------------------------------------------
>
> Key: FLINK-26932
> URL: https://issues.apache.org/jira/browse/FLINK-26932
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.11.0
> Reporter: huweihua
> Priority: Major
> Attachments: 1280X1280.png,
> origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in
> cleanupAllocationBaseDirs and took the main thread.
>
> So this TaskManager would not respond to the
> cancelTask/disconnectResourceManager request.
>
> At the same time, JobMaster already take this TaskManager is lost, and
> schedule task to other TaskManager.
>
> This may cause some unexpected task running.
>
> After checking the log of TaskManager, TM already lost the connection with
> ResourceManager, and it is always trying to register with ResourceManager.
> The RegistrationTimeout cannot take effect because the main thread of
> TaskManager is hung-up.
>
> I think there are two options to handle it.
> Option 1: Add timeout for
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal
> will the concurrency problem
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)