[ 
https://issues.apache.org/jira/browse/FLINK-22276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321312#comment-17321312
 ] 

Matthias commented on FLINK-22276:
----------------------------------

[~Thesharing] thanks for your analysis. I guess, you're right. The 
{{DefaultScheduler.delayExecutor}} is multi-threaded. That's where the race 
condition between restarting the two tasks and archiving the failure of the 
single task happens. We should have considered the {{ExecutionVertex's}} 
version when checking the tasks for failures. Already restarted tasks should 
not be considered for archiving as their failure should have been archived 
already.

One thing, that's probably unrelated, but maybe somebody can give me a reason 
for it: The {{RestartPipelinedRegionFailoverStrategy}} selects only one task 
for restart when the first failure occurs. For the second failure, the 
{{RestartPipelinedRegionFailoverStrategy}} selects two tasks for restart. 
Shouldn't the strategy always select the two tasks for restart considering that 
both belong to the same pipeline region? Or am I missing something here?

I'm gonna continue work on this tomorrow...

> ExceptionHistoryEntryExtractor throws fatal error when task failure
> -------------------------------------------------------------------
>
>                 Key: FLINK-22276
>                 URL: https://issues.apache.org/jira/browse/FLINK-22276
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Jin Xing
>            Assignee: Matthias
>            Priority: Blocker
>             Fix For: 1.13.0
>
>         Attachments: image-2021-04-14-17-50-45-199.png, log
>
>
> When running my batch job on Flink cluster, I got a fatal error as below and 
> JM exits:
> !image-2021-04-14-17-50-45-199.png!
> Digging into the code,   when DefaultScheduler start archiving failure cause 
> ([https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java#L259),]
>  seems Execution#failureCause is not safely/correctly attached/updated.
> I attached JM log, [~mapohl] Would you mind help verify on this ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to