[
https://issues.apache.org/jira/browse/FLINK-22276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321312#comment-17321312
]
Matthias commented on FLINK-22276:
----------------------------------
[~Thesharing] thanks for your analysis. I guess, you're right. The
{{DefaultScheduler.delayExecutor}} is multi-threaded. That's where the race
condition between restarting the two tasks and archiving the failure of the
single task happens. We should have considered the {{ExecutionVertex's}}
version when checking the tasks for failures. Already restarted tasks should
not be considered for archiving as their failure should have been archived
already.
One thing, that's probably unrelated, but maybe somebody can give me a reason
for it: The {{RestartPipelinedRegionFailoverStrategy}} selects only one task
for restart when the first failure occurs. For the second failure, the
{{RestartPipelinedRegionFailoverStrategy}} selects two tasks for restart.
Shouldn't the strategy always select the two tasks for restart considering that
both belong to the same pipeline region? Or am I missing something here?
I'm gonna continue work on this tomorrow...
> ExceptionHistoryEntryExtractor throws fatal error when task failure
> -------------------------------------------------------------------
>
> Key: FLINK-22276
> URL: https://issues.apache.org/jira/browse/FLINK-22276
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Jin Xing
> Assignee: Matthias
> Priority: Blocker
> Fix For: 1.13.0
>
> Attachments: image-2021-04-14-17-50-45-199.png, log
>
>
> When running my batch job on Flink cluster, I got a fatal error as below and
> JM exits:
> !image-2021-04-14-17-50-45-199.png!
> Digging into the code, when DefaultScheduler start archiving failure cause
> ([https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java#L259),]
> seems Execution#failureCause is not safely/correctly attached/updated.
> I attached JM log, [~mapohl] Would you mind help verify on this ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)