[ 
https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn Visser reassigned FLINK-33121:
--------------------------------------

    Assignee: Panagiotis Garefalakis

> Failed precondition in JobExceptionsHandler due to concurrent global failures
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-33121
>                 URL: https://issues.apache.org/jira/browse/FLINK-33121
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>              Labels: pull-request-available
>
> {{JobExceptionsHandler#createRootExceptionInfo}} makes the assumption that 
> *Global* Failures (with null Task name) may *only* be RootExceptions (jobs 
> are considered in FAILED state when this happens and no further exceptions 
> are captured) and *Local/Task* may be part of concurrent exceptions List *--* 
> if this precondition is violated, an assertion is thrown as part of 
> {{{}asserLocalExceptionInfo{}}}.
> The issue lies within 
> [convertFailures|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java#L422]]
>  logic where we take the failureCollection pointer and convert it to a 
> HistoryEntry.
> In more detail, we are passing the first Failure and a pointer to the 
> remaining failures collection as part of HistoryEntry creation — and then add 
> the entry in the exception History.
> In our specific scenario a Local Failure first comes in, we call 
> convertFailures that creates a HistoryEntry and removes the LocalFailure from 
> the collection while also passing a pointer to the empty failureCollection. 
> Then a Global failure comes in (and before conversion), it is added to the 
> failureCollection (that was empty) just before serving the requestJob that 
> returns the List of History Entries.
> This messes things up, as the LocalFailure now has a 
> ConcurrentExceptionsCollection with a Global Failure that should never happen 
> (causing the assertion).
> A solution is to create a Copy of the failureCollection in the conversion 
> instead of passing the pointer around (as I did in the updated PR)
> This PR also fixes a smaller bug where we dont pass the 
> [taskName|[https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]
>  properly.
> Note: DefaultScheduler does not suffer from this issue as it treats failures 
> directly as HistoryEntries (no conversion step)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to