[
https://issues.apache.org/jira/browse/BEAM-10955?focusedWorklogId=620300&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-620300
]
ASF GitHub Bot logged work on BEAM-10955:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 08/Jul/21 01:02
Start Date: 08/Jul/21 01:02
Worklog Time Spent: 10m
Work Description: ibzib commented on a change in pull request #15127:
URL: https://github.com/apache/beam/pull/15127#discussion_r665798660
##########
File path:
runners/flink/src/test/java/org/apache/beam/runners/flink/FlinkSavepointTest.java
##########
@@ -159,6 +157,9 @@ private void runSavepointAndRestore(boolean
isPortablePipeline) throws Exception
// Initial parallelism
options.setParallelism(2);
options.setRunner(FlinkRunner.class);
+ // Enable checkpointing interval for streaming non portable pipeline to
avoid
Review comment:
> I'm not sure about this, but when I set a checkpointing interval for a
portable pipeline, it shows a timeout error like in
https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3819/testReport/org.apache.beam.runners.flink/FlinkSavepointTest/testSavepointRestorePortable/.
I'm not sure which error you are talking about? If the test passed, it's
likely it's benign.
> The reason behind this fix is to enable restart after some job failure.
> When this test fails, continuously shows the error: "Recovery is
suppressed by NoRestartBackoffTimeStrategy" like in
https://scans.gradle.com/s/n2coqujl4jc7i/tests/:runners:flink:1.13:test/org.apache.beam.runners.flink.FlinkSavepointTest/testSavepointRestoreLegacy?top-execution=1.
Thanks for getting the build scan. It looks like something is going wrong
while taking the savepoint. It looks like it could be a real bug, so let's wait
to merge this until we are sure that's not the case.
```
Caused by: java.lang.Exception: Could not materialize checkpoint 2 for
operator VerificationStage/ParMultiDo(Anonymous) (1/2)#0. |
| at
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:257)
|
| ... 4 more |
| Caused by: java.lang.IllegalArgumentException |
| at
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) |
| at
org.apache.flink.runtime.checkpoint.CheckpointMetrics.<init>(CheckpointMetrics.java:74)
|
| at
org.apache.flink.runtime.checkpoint.CheckpointMetricsBuilder.build(CheckpointMetricsBuilder.java:135)
|
| at
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.reportCompletedSnapshotStates(AsyncCheckpointRunnable.java:206)
|
| at
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:158)
|
| ... 3 more
```
It looks like the failed precondition is checking `alignmentDurationNanos`.
I'm not sure however what the unacceptable value is, nor where it is coming
from.
https://github.com/apache/flink/blob/3909c9f0a11e8b38b264db9e7716fb41e75cc524/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointMetrics.java#L74
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 620300)
Time Spent: 8h 10m (was: 8h)
> Flink Java Runner test flake: Could not find Flink job
> -------------------------------------------------------
>
> Key: BEAM-10955
> URL: https://issues.apache.org/jira/browse/BEAM-10955
> Project: Beam
> Issue Type: Bug
> Components: runner-flink
> Reporter: Valentyn Tymofieiev
> Assignee: Benjamin Gonzalez
> Priority: P1
> Labels: currently-failing, flake
> Time Spent: 8h 10m
> Remaining Estimate: 0h
>
> Opening to track how frequent this is. Observed on
> org.apache.beam.runners.flink.FlinkSavepointTest.testSavepointRestoreLegacy
> in https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/13716
> {noformat}
> java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (6eddfd5820521f0898920a538c4a82dd)
> Stacktrace
> java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (6eddfd5820521f0898920a538c4a82dd)
> at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.beam.runners.flink.FlinkSavepointTest.takeSavepointAndCancelJob(FlinkSavepointTest.java:253)
> at
> org.apache.beam.runners.flink.FlinkSavepointTest.runSavepointAndRestore(FlinkSavepointTest.java:172)
> at
> org.apache.beam.runners.flink.FlinkSavepointTest.testSavepointRestoreLegacy(FlinkSavepointTest.java:144)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could
> not find Flink job (6eddfd5820521f0898920a538c4a82dd)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGatewayFuture(Dispatcher.java:803)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.triggerSavepoint(Dispatcher.java:582)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:274)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:189)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Standard Output
> Shutting SDK harness down.
> Shutting SDK harness down.
> Standard Error
> [Test worker] INFO org.apache.beam.runners.flink.FlinkSavepointTest -
> Savepoints will be written to file:///tmp/junit5733722483577894112
> [Test worker] INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils - The
> configuration option Key: 'taskmanager.cpu.cores' , default: null (fallback
> keys: []) required for local execution is not set, setting it to its default
> value 1.7976931348623157E308
> [Test worker] INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils - The con
> ...[truncated 527576 chars]...
> flink-metrics-2] INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService -
> Stopped Akka RPC service.
> [flink-metrics-2] INFO org.apache.flink.runtime.blob.PermanentBlobCache -
> Shutting down BLOB cache
> [flink-metrics-2] INFO org.apache.flink.runtime.blob.TransientBlobCache -
> Shutting down BLOB cache
> [flink-metrics-2] INFO org.apache.flink.runtime.blob.BlobServer - Stopped
> BLOB server at 0.0.0.0:46101
> [flink-metrics-2] INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService -
> Stopped Akka RPC service.
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)