[jira] [Commented] (FLINK-34416) "Local recovery and sticky scheduling end-to-end test" still doesn't work with AdaptiveScheduler

Matthias Pohl (Jira) Thu, 08 Jan 2026 00:43:12 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050549#comment-18050549
 ]


Matthias Pohl commented on FLINK-34416:
---------------------------------------

Just for more context (after digging through Jira and PRs): FLINK-21450 
introduced a fix for local recovery in the AdaptiveScheduler which doesn't seem 
to cover all cases. The localRecovery-related tests which we wanted to enable 
after applying the fix still fail (see FLINK-34409).

I added context on why some of the tests are not working to the 
[PR|https://github.com/apache/flink/pull/24285/commits/5845b50565f8bd5c14df3eb6e4da09a4ff00c42d]:
{quote}// The AdaptiveScheduler doesn't support partial recovery but restarts 
all Executions in case of
// a local failure.
{quote}
and
{quote}// The AdaptiveScheduler doesn't update the ExecutionGraph but creates a 
new Execution during
// local recovery. Recovering can also lead to a change in parallelism which 
makes the
// executionHistory non-linear. The lack of a linear executionHistory prevents 
us from applying
// the same test for the AdaptiveScheduler.
{quote}
That can be used as a base to continue the investigation around this Jira issue 
here.

> "Local recovery and sticky scheduling end-to-end test" still doesn't work 
> with AdaptiveScheduler
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-34416
>                 URL: https://issues.apache.org/jira/browse/FLINK-34416
>             Project: Flink
>          Issue Type: Technical Debt
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0, 1.18.1, 1.20.0
>            Reporter: Matthias Pohl
>            Priority: Major
>              Labels: test-stability
>
> We tried to enable all {{AdaptiveScheduler}}-related tests in FLINK-34409 
> because it appeared that all Jira issues that were referenced are resolved. 
> That's not the case for the {{"Local recovery and sticky scheduling 
> end-to-end test"}} tests, though.
> With the {{AdaptiveScheduler}} being enabled, we run into issues where the 
> test runs forever due to a {{NullPointerException}} continuously triggering a 
> failure:
> {code}
> Feb 07 19:02:59 2024-02-07 19:02:21,706 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map -> 
> Sink: Unnamed (3/4) 
> (54075d3d22edb729e5f396726f777860_20ba6b65f97481d5570070de90e4e791_2_16292) 
> switched from INITIALIZING to FAILED on localhost:40893-09ff7>
> Feb 07 19:02:59 java.lang.NullPointerException: Expected to find info here.
> Feb 07 19:02:59         at 
> org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:76) 
> ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.tests.StickyAllocationAndLocalRecoveryTestJob$StateCreatingFlatMap.initializeState(StickyAllocationAndLocalRecoveryTestJob.java:340)
>  ~[?:?]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:187)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:169)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:134)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:285)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateAndGates(StreamTask.java:799)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$restoreInternal$3(StreamTask.java:753)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:753)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:712)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) 
> ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:751) 
> ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at 
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) 
> ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> Feb 07 19:02:59         at java.lang.Thread.run(Thread.java:750) 
> ~[?:1.8.0_402]
> {code}
> This error is caused by a Precondition in 
> [StickyAllocationAndLocalRecoveryTestJob:340|https://github.com/apache/flink/blob/0f3470db83c1fddba9ac9a7299b1e61baab4ff12/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java#L340]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34416) "Local recovery and sticky scheduling end-to-end test" still doesn't work with AdaptiveScheduler

Reply via email to