[
https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Panagiotis Garefalakis updated FLINK-33121:
-------------------------------------------
Description:
We make the assumption that Global Failures (with null Task name) may only be
RootExceptions and and Local/Task exception may be part of concurrent
exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}) --
However, when the Adaptive scheduler is in a Restarting phase due to an
existing failure (that is now the new Root) we can still, in rare occasions,
capture new Global failures, violating this condition (with an assertion is
thrown as part of {{{}assertLocalExceptionInfo{}}}).
A solution to this could be to ignore Global failures while being in a
Restarting phase on the Adaptive scheduler.
This PR also fixes a smaller bug where we dont pass the
[taskName|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]
properly.
Note: DefaultScheduler does not suffer from this issue as it treats failures
directly as HistoryEntries (no conversion step)
was:
{{JobExceptionsHandler#createRootExceptionInfo}} makes the assumption that
*Global* Failures (with null Task name) may *only* be RootExceptions (jobs are
considered in FAILED state when this happens and no further exceptions are
captured) and *Local/Task* may be part of concurrent exceptions List *--* if
this precondition is violated, an assertion is thrown as part of
{{{}asserLocalExceptionInfo{}}}.
The issue lies within
[convertFailures|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java#L422]]
logic where we take the failureCollection pointer and convert it to a
HistoryEntry.
In more detail, we are passing the first Failure and a pointer to the remaining
failures collection as part of HistoryEntry creation — and then add the entry
in the exception History.
In our specific scenario a Local Failure first comes in, we call
convertFailures that creates a HistoryEntry and removes the LocalFailure from
the collection while also passing a pointer to the empty failureCollection.
Then a Global failure comes in (and before conversion), it is added to the
failureCollection (that was empty) just before serving the requestJob that
returns the List of History Entries.
This messes things up, as the LocalFailure now has a
ConcurrentExceptionsCollection with a Global Failure that should never happen
(causing the assertion).
A solution is to create a Copy of the failureCollection in the conversion
instead of passing the pointer around (as I did in the updated PR)
This PR also fixes a smaller bug where we dont pass the
[taskName|[https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]
properly.
Note: DefaultScheduler does not suffer from this issue as it treats failures
directly as HistoryEntries (no conversion step)
> Failed precondition in JobExceptionsHandler due to concurrent global failures
> -----------------------------------------------------------------------------
>
> Key: FLINK-33121
> URL: https://issues.apache.org/jira/browse/FLINK-33121
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Panagiotis Garefalakis
> Assignee: Panagiotis Garefalakis
> Priority: Major
> Labels: pull-request-available
>
> We make the assumption that Global Failures (with null Task name) may only be
> RootExceptions and and Local/Task exception may be part of concurrent
> exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}) --
> However, when the Adaptive scheduler is in a Restarting phase due to an
> existing failure (that is now the new Root) we can still, in rare occasions,
> capture new Global failures, violating this condition (with an assertion is
> thrown as part of {{{}assertLocalExceptionInfo{}}}).
> A solution to this could be to ignore Global failures while being in a
> Restarting phase on the Adaptive scheduler.
> This PR also fixes a smaller bug where we dont pass the
> [taskName|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]
> properly.
> Note: DefaultScheduler does not suffer from this issue as it treats failures
> directly as HistoryEntries (no conversion step)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)