[
https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chesnay Schepler updated FLINK-33121:
-------------------------------------
Affects Version/s: 1.18.0
> Failed precondition in JobExceptionsHandler due to concurrent global failures
> -----------------------------------------------------------------------------
>
> Key: FLINK-33121
> URL: https://issues.apache.org/jira/browse/FLINK-33121
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.0
> Reporter: Panagiotis Garefalakis
> Assignee: Panagiotis Garefalakis
> Priority: Major
> Labels: pull-request-available
>
> We make the assumption that Global Failures (with null Task name) may only be
> RootExceptions and and Local/Task exception may be part of concurrent
> exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}).
> However, when the Adaptive scheduler is in a Restarting phase due to an
> existing failure (that is now the new Root) we can still, in rare occasions,
> capture new Global failures, violating this condition (with an assertion is
> thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like:
> {code:java}
> The taskName must not be null for a non-global failure. {code}
> We want to ignore Global failures while being in a Restarting phase on the
> Adaptive scheduler until we properly support multiple Global failures in the
> Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922
> Note: DefaultScheduler does not suffer from this issue as it treats failures
> directly as HistoryEntries (no conversion step)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)