[jira] [Comment Edited] (FLINK-26274) Test local recovery works across TaskManager process restarts

Johnson Okorie (Jira) Wed, 16 Mar 2022 06:42:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507607#comment-17507607
 ]


Johnson Okorie edited comment on FLINK-26274 at 3/16/22, 1:41 PM:
------------------------------------------------------------------

Hi, I tested this feature from the master branch and it doesn't always work for 
me. I tried to follow the same configurations above. I had 3 TMs with a 
parallelism of 3 though (So 1 slot per TM). If I scale down and then up one TM 
really quickly, it works fine. For a longer period (1+ minute), when I scale 
the TMs back up to 3, I can see that the TM re-offers the previous slot.

[^taskmanager-2.log] (Grepped slot allocation related logs)

 It also seems the JM accepts the slot but nothing happens from there. After 
the slot timeout, the slot gets released and the TM offers a new slot, 
triggering recovery from remote storage. 

The trace logs show that the Slot transitioned from FREE -> PENDING -> 
ALLOCATED -> FREE.

(I am still new to flink, so might be doing something very wrong)


was (Author: JIRAUSER286685):
Hi, I tested this feature from the master branch and it doesn't always work for 
me. I tried to follow the same configurations above. I had 3 TMs with a 
parallelism of 3 though (So 1 slot per TM). If I scale down and then up one TM 
really quickly, it works fine. For a longer period (1+ minute), when I scale 
the TMs back up to 3, I can see that the TM re-offers the previous slot. It 
also seems the JM accepts the slot but nothing happens from there. After the 
slot timeout, the slot gets released and the TM offers a new slot, triggering 
recovery from remote storage. 

[^taskmanager-2.log] (Grepped slot allocation related logs)

On the job manager side, I see the slot allocation was registered and matched, 
but still timed out after 50 seconds.

(I am still new to flink, so might be doing something very wrong)

> Test local recovery works across TaskManager process restarts
> -------------------------------------------------------------
>
>                 Key: FLINK-26274
>                 URL: https://issues.apache.org/jira/browse/FLINK-26274
>             Project: Flink
>          Issue Type: Technical Debt
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Till Rohrmann
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>         Attachments: jobmanager_local_restore_2.log, taskmanager-2.log, 
> taskmanager_flink-taskmanager-2_log
>
>
> This ticket is a testing task for 
> [FLIP-201|https://cwiki.apache.org/confluence/x/wJuqCw].
> When enabling local recovery and configuring a working directory that can be 
> re-read after a process failure, Flink should now be able to recover locally. 
> We should test whether this is the case. Please take a look at the 
> documentation [1, 2] to see how to configure Flink to make use of it.
> [1] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/working_directory/
> [2] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#enabling-local-recovery-across-pod-restarts



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-26274) Test local recovery works across TaskManager process restarts

Reply via email to