[ 
https://issues.apache.org/jira/browse/FLINK-21103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282453#comment-17282453
 ] 

Arvid Heise commented on FLINK-21103:
-------------------------------------

Another case 
[https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/13184/logs/170]

Here I may have gotten some root cause


{noformat}
2021-02-10T10:44:49.4158320Z Feb 10 10:44:49 Waiting for job to process up to 
1000 records, current progress: 943 records ...
2021-02-10T10:44:52.2562683Z 
2021-02-10T10:44:52.2566380Z 
------------------------------------------------------------
2021-02-10T10:44:52.2568765Z  The program finished with the following exception:
2021-02-10T10:44:52.2570340Z 
2021-02-10T10:44:52.2575752Z org.apache.flink.util.FlinkException: Triggering a 
savepoint for the job ecf66fd871368d8f7fc8fc9ee25b4972 failed.
2021-02-10T10:44:52.2578429Z    at 
org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
2021-02-10T10:44:52.2580763Z    at 
org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
2021-02-10T10:44:52.2583490Z    at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
2021-02-10T10:44:52.2586043Z    at 
org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
2021-02-10T10:44:52.2588747Z    at 
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
2021-02-10T10:44:52.2591056Z    at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
2021-02-10T10:44:52.2593583Z    at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
2021-02-10T10:44:52.2612006Z    at 
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
2021-02-10T10:44:52.2613264Z Caused by: 
java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
checkpoint failed.
2021-02-10T10:44:52.2614138Z    at 
org.apache.flink.runtime.scheduler.SchedulerBase.lambda$triggerSavepoint$5(SchedulerBase.java:946)
2021-02-10T10:44:52.2614819Z    at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
2021-02-10T10:44:52.2615430Z    at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
2021-02-10T10:44:52.2616068Z    at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
2021-02-10T10:44:52.2616750Z    at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:442)
2021-02-10T10:44:52.2617375Z    at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:209)
2021-02-10T10:44:52.2618037Z    at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
2021-02-10T10:44:52.2618699Z    at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159)
2021-02-10T10:44:52.2619304Z    at 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
2021-02-10T10:44:52.2620094Z    at 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
2021-02-10T10:44:52.2620656Z    at 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
2021-02-10T10:44:52.2621230Z    at 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
2021-02-10T10:44:52.2625679Z    at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
2021-02-10T10:44:52.2626154Z    at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2021-02-10T10:44:52.2626612Z    at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2021-02-10T10:44:52.2627051Z    at 
akka.actor.Actor$class.aroundReceive(Actor.scala:517)
2021-02-10T10:44:52.2627486Z    at 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
2021-02-10T10:44:52.2627912Z    at 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
2021-02-10T10:44:52.2628325Z    at 
akka.actor.ActorCell.invoke(ActorCell.scala:561)
2021-02-10T10:44:52.2628731Z    at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
2021-02-10T10:44:52.2629122Z    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
2021-02-10T10:44:52.2629506Z    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
2021-02-10T10:44:52.2629925Z    at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2021-02-10T10:44:52.2630402Z    at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
2021-02-10T10:44:52.2630880Z    at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
2021-02-10T10:44:52.2631368Z    at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-02-10T10:44:52.2631980Z Caused by: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
checkpoint failed.
2021-02-10T10:44:52.2632594Z    at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
2021-02-10T10:44:52.2633120Z    at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
2021-02-10T10:44:52.2633616Z    at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
2021-02-10T10:44:52.2634129Z    at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
2021-02-10T10:44:52.2634645Z    at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
2021-02-10T10:44:52.2635173Z    at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
2021-02-10T10:44:52.2635733Z    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$0(CheckpointCoordinator.java:485)
2021-02-10T10:44:52.2636292Z    at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
2021-02-10T10:44:52.2636827Z    at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
2021-02-10T10:44:52.2637352Z    at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
2021-02-10T10:44:52.2637876Z    at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
2021-02-10T10:44:52.2638408Z    at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:515)
2021-02-10T10:44:52.2638988Z    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1911)
2021-02-10T10:44:52.2639617Z    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:985)
2021-02-10T10:44:52.2640218Z    at 
org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$8(SchedulerBase.java:1004)
2021-02-10T10:44:52.2640821Z    at 
org.apache.flink.runtime.scheduler.SchedulerBase.lambda$processCheckpointCoordinatorMessage$9(SchedulerBase.java:1020)
2021-02-10T10:44:52.2641385Z    at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2021-02-10T10:44:52.2641842Z    at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
2021-02-10T10:44:52.2642478Z    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
2021-02-10T10:44:52.2645639Z    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
2021-02-10T10:44:52.2646190Z    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2021-02-10T10:44:52.2646781Z    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2021-02-10T10:44:52.2647210Z    at java.lang.Thread.run(Thread.java:748)
2021-02-10T10:44:52.2647663Z Caused by: 
org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
checkpoint failed.
2021-02-10T10:44:52.2648194Z    at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:514)
2021-02-10T10:44:52.2648583Z    ... 11 more
2021-02-10T10:44:52.2648977Z Caused by: 
org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
checkpoint failed.
2021-02-10T10:44:52.2649869Z    at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279)
2021-02-10T10:44:52.2650492Z    at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:175)
2021-02-10T10:44:52.2650905Z    ... 3 more
2021-02-10T10:44:52.2652944Z Caused by: java.lang.Exception: Could not 
materialize checkpoint 19 for operator TtlVerifyUpdateFunction -> Sink: 
PrintFailedVerifications (3/3)#0.
2021-02-10T10:44:52.2653672Z    at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:257)
2021-02-10T10:44:52.2654125Z    ... 4 more
2021-02-10T10:44:52.2654476Z Caused by: 
java.util.concurrent.ExecutionException: java.util.NoSuchElementException
2021-02-10T10:44:52.2654943Z    at 
java.util.concurrent.FutureTask.report(FutureTask.java:122)
2021-02-10T10:44:52.2655374Z    at 
java.util.concurrent.FutureTask.get(FutureTask.java:192)
2021-02-10T10:44:52.2655856Z    at 
org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:621)
2021-02-10T10:44:52.2656513Z    at 
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:54)
2021-02-10T10:44:52.2663244Z    at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:128)
2021-02-10T10:44:52.2663669Z    ... 3 more
2021-02-10T10:44:52.2663957Z Caused by: java.util.NoSuchElementException
2021-02-10T10:44:52.2664330Z    at 
java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
2021-02-10T10:44:52.2664780Z    at 
java.util.HashMap$EntryIterator.next(HashMap.java:1479)
2021-02-10T10:44:52.2665203Z    at 
java.util.HashMap$EntryIterator.next(HashMap.java:1477)
2021-02-10T10:44:52.2665752Z    at 
org.apache.flink.runtime.state.heap.HeapKeyValueStateIterator$MapStateIterator.writeOutNext(HeapKeyValueStateIterator.java:343)
2021-02-10T10:44:52.2666427Z    at 
org.apache.flink.runtime.state.heap.HeapKeyValueStateIterator$StateTableIterator.writeOutMap(HeapKeyValueStateIterator.java:310)
2021-02-10T10:44:52.2667110Z    at 
org.apache.flink.runtime.state.heap.HeapKeyValueStateIterator$StateTableIterator.writeOutNext(HeapKeyValueStateIterator.java:271)
2021-02-10T10:44:52.2667881Z    at 
org.apache.flink.runtime.state.heap.HeapKeyValueStateIterator.next(HeapKeyValueStateIterator.java:173)
2021-02-10T10:44:52.2668480Z    at 
org.apache.flink.runtime.state.FullSnapshotAsyncWriter.writeKVStateData(FullSnapshotAsyncWriter.java:211)
2021-02-10T10:44:52.2669103Z    at 
org.apache.flink.runtime.state.FullSnapshotAsyncWriter.writeSnapshotToOutputStream(FullSnapshotAsyncWriter.java:107)
2021-02-10T10:44:52.2669658Z    at 
org.apache.flink.runtime.state.FullSnapshotAsyncWriter.get(FullSnapshotAsyncWriter.java:77)
2021-02-10T10:44:52.2670202Z    at 
org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:96)
2021-02-10T10:44:52.2670781Z    at 
org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:93)
2021-02-10T10:44:52.2671342Z    at 
org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:78)
2021-02-10T10:44:52.2671918Z    at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
2021-02-10T10:44:52.2672386Z    at 
org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:618)
2021-02-10T10:44:52.2672778Z    ... 5 more
2021-02-10T10:44:52.9792956Z Feb 10 10:44:52 Cancelling job 
ecf66fd871368d8f7fc8fc9ee25b4972.
2021-02-10T10:44:54.3006086Z Feb 10 10:44:54 Cancelled job 
ecf66fd871368d8f7fc8fc9ee25b4972.
2021-02-10T10:44:54.3309952Z Feb 10 10:44:54 
/home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT/bin/flink
 run -d -p 3 
/home/vsts/work/1/s/flink-end-to-end-tests/flink-stream-state-ttl-test/target/DataStreamStateTTLTestProgram.jar
 --test.semantics exactly-once --environment.parallelism 3 --state_backend file 
--state_ttl_verifier.ttl_milli 1000 --state_backend.checkpoint_directory 
file:///home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-19516496225/savepoint-e2e-test-chckpt-dir
 --state_backend.file.async true --update_generator_source.sleep_time 10 
--update_generator_source.sleep_after_elements 1
2021-02-10T10:45:00.2710599Z Feb 10 10:45:00 Job 
(9f86a119830ae2500b1c7c45a8705ba1) is running.
2021-02-10T10:45:00.2909160Z Feb 10 10:45:00 Waiting for job to process up to 
1000 records, current progress: 0 records ...
...
 
2021-02-10T12:19:49.2341516Z Feb 10 12:19:49 Waiting for job to process up to 
1000 records, current progress: 0 records ...
2021-02-10T12:19:50.3102689Z 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_stream_state_ttl.sh:
 line 78: local: can only be used in a function
2021-02-10T12:19:50.3104782Z Feb 10 12:19:50 Checking for non-empty .out 
files...
2021-02-10T12:19:50.3264423Z Feb 10 12:19:50 No non-empty .out files.
2021-02-10T12:19:50.3363980Z Feb 10 12:19:50 Checking of logs skipped.
2021-02-10T12:19:50.3364793Z Feb 10 12:19:50 
2021-02-10T12:19:50.3366101Z Feb 10 12:19:50 [PASS] 'State TTL Heap backend 
end-to-end test' passed after 95 minutes and 31 seconds! Test exited with exit 
code 0.
2021-02-10T12:19:50.3366756Z Feb 10 12:19:50 {noformat}
So I think we have 3 issues:
1. The test itself fails, possibly with a production-code bug.

2. Job restarts but is not tracked properly (test bug).

3. Some cleanup in the test fails leading to a passed test altough it failed.

> E2e tests time out on azure
> ---------------------------
>
>                 Key: FLINK-21103
>                 URL: https://issues.apache.org/jira/browse/FLINK-21103
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / Azure Pipelines, Tests
>    Affects Versions: 1.13.0
>            Reporter: Dawid Wysakowicz
>            Priority: Major
>              Labels: test-stability
>             Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12377&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529
> {code}
> Creating worker2 ... done
> Jan 22 13:16:17 Waiting for hadoop cluster to come up. We have been trying 
> for 0 seconds, retrying ...
> Jan 22 13:16:22 Waiting for hadoop cluster to come up. We have been trying 
> for 5 seconds, retrying ...
> Jan 22 13:16:27 Waiting for hadoop cluster to come up. We have been trying 
> for 10 seconds, retrying ...
> Jan 22 13:16:32 Waiting for hadoop cluster to come up. We have been trying 
> for 15 seconds, retrying ...
> Jan 22 13:16:37 Waiting for hadoop cluster to come up. We have been trying 
> for 20 seconds, retrying ...
> Jan 22 13:16:43 Waiting for hadoop cluster to come up. We have been trying 
> for 26 seconds, retrying ...
> Jan 22 13:16:48 Waiting for hadoop cluster to come up. We have been trying 
> for 31 seconds, retrying ...
> Jan 22 13:16:53 Waiting for hadoop cluster to come up. We have been trying 
> for 36 seconds, retrying ...
> Jan 22 13:16:58 Waiting for hadoop cluster to come up. We have been trying 
> for 41 seconds, retrying ...
> Jan 22 13:17:03 Waiting for hadoop cluster to come up. We have been trying 
> for 46 seconds, retrying ...
> Jan 22 13:17:08 We only have 0 NodeManagers up. We have been trying for 0 
> seconds, retrying ...
> 21/01/22 13:17:10 INFO client.RMProxy: Connecting to ResourceManager at 
> master.docker-hadoop-cluster-network/172.19.0.3:8032
> 21/01/22 13:17:11 INFO client.AHSProxy: Connecting to Application History 
> server at master.docker-hadoop-cluster-network/172.19.0.3:10200
> Jan 22 13:17:11 We now have 2 NodeManagers up.
> ==========================================================================================
> === WARNING: This E2E Run took already 80% of the allocated time budget of 
> 250 minutes ===
> ==========================================================================================
> ======================================================================================================
> === WARNING: This E2E Run will time out in the next few minutes. Starting to 
> upload the log output ===
> ======================================================================================================
> ##[error]The task has timed out.
> Async Command Start: Upload Artifact
> Uploading 1 files
> File upload succeed.
> Upload '/tmp/_e2e_watchdog.output.0' to file container: 
> '#/11824779/e2e-timeout-logs'
> Associated artifact 140921 with build 12377
> Async Command End: Upload Artifact
> Async Command Start: Upload Artifact
> Uploading 1 files
> File upload succeed.
> Upload '/tmp/_e2e_watchdog.output.1' to file container: 
> '#/11824779/e2e-timeout-logs'
> Associated artifact 140921 with build 12377
> Async Command End: Upload Artifact
> Async Command Start: Upload Artifact
> Uploading 1 files
> File upload succeed.
> Upload '/tmp/_e2e_watchdog.output.2' to file container: 
> '#/11824779/e2e-timeout-logs'
> Associated artifact 140921 with build 12377
> Async Command End: Upload Artifact
> Finishing: Run e2e tests
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to