[ 
https://issues.apache.org/jira/browse/BEAM-10955?focusedWorklogId=661899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-661899
 ]

ASF GitHub Bot logged work on BEAM-10955:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Oct/21 20:38
            Start Date: 07/Oct/21 20:38
    Worklog Time Spent: 10m 
      Work Description: benWize commented on pull request #15664:
URL: https://github.com/apache/beam/pull/15664#issuecomment-938138029


   Hi, @ibzib would you help me to review this? 
   I think the cause is that `triggerSavepoint` in 
   
https://github.com/apache/beam/blob/master/runners/flink/src/test/java/org/apache/beam/runners/flink/FlinkSavepointTest.java#L259
    is called twice as shown in the logs here 
https://scans.gradle.com/s/tdagq66c7f4n2/tests/:runners:flink:1.13:test/org.apache.beam.runners.flink.FlinkSavepointTest/testSavepointRestoreLegacy/1/output
 (A log with the message: `Triggering cancel-with-savepoint` is shown twice). 
   The first loop tries to trigger the savepoint and cancel, if the first call 
in the loop throws an exception, a second loop tries to trigger the savepoint 
while some tasks are been canceled, which causes a fail with the message 
`Failed to trigger checkpoint for job xxx since Checkpoint triggering task 
Source: Impulse (1/1) of job xxx is not being executed at the moment. Aborting 
checkpoint. Failure reason: Not all required tasks are currently running.`
   I was able to reproduce the error shown in the logs locally, executing 
`triggerSavepoint` twice. 
   My proposed fix is to split trigger and cancel to prevent canceling if the 
savepoint operator fails on its first call.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 661899)
    Time Spent: 16h 10m  (was: 16h)

> Flink Java Runner test flake: Could not find Flink job 
> (FlinkJobNotFoundException)
> ----------------------------------------------------------------------------------
>
>                 Key: BEAM-10955
>                 URL: https://issues.apache.org/jira/browse/BEAM-10955
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>            Reporter: Valentyn Tymofieiev
>            Assignee: Benjamin Gonzalez
>            Priority: P1
>              Labels: currently-failing, flake
>          Time Spent: 16h 10m
>  Remaining Estimate: 0h
>
> Opening to track how frequent this is. Observed on
> org.apache.beam.runners.flink.FlinkSavepointTest.testSavepointRestoreLegacy
> inĀ https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/13716
> {noformat}
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (6eddfd5820521f0898920a538c4a82dd)
> Stacktrace
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (6eddfd5820521f0898920a538c4a82dd)
>       at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>       at 
> org.apache.beam.runners.flink.FlinkSavepointTest.takeSavepointAndCancelJob(FlinkSavepointTest.java:253)
>       at 
> org.apache.beam.runners.flink.FlinkSavepointTest.runSavepointAndRestore(FlinkSavepointTest.java:172)
>       at 
> org.apache.beam.runners.flink.FlinkSavepointTest.testSavepointRestoreLegacy(FlinkSavepointTest.java:144)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (6eddfd5820521f0898920a538c4a82dd)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGatewayFuture(Dispatcher.java:803)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.triggerSavepoint(Dispatcher.java:582)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:274)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:189)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>       at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>       at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Standard Output
> Shutting SDK harness down.
> Shutting SDK harness down.
> Standard Error
> [Test worker] INFO org.apache.beam.runners.flink.FlinkSavepointTest - 
> Savepoints will be written to file:///tmp/junit5733722483577894112
> [Test worker] INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils - The 
> configuration option Key: 'taskmanager.cpu.cores' , default: null (fallback 
> keys: []) required for local execution is not set, setting it to its default 
> value 1.7976931348623157E308
> [Test worker] INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils - The con
> ...[truncated 527576 chars]...
> flink-metrics-2] INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - 
> Stopped Akka RPC service.
> [flink-metrics-2] INFO org.apache.flink.runtime.blob.PermanentBlobCache - 
> Shutting down BLOB cache
> [flink-metrics-2] INFO org.apache.flink.runtime.blob.TransientBlobCache - 
> Shutting down BLOB cache
> [flink-metrics-2] INFO org.apache.flink.runtime.blob.BlobServer - Stopped 
> BLOB server at 0.0.0.0:46101
> [flink-metrics-2] INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - 
> Stopped Akka RPC service.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to