ibzib commented on a change in pull request #15127:
URL: https://github.com/apache/beam/pull/15127#discussion_r665798660



##########
File path: 
runners/flink/src/test/java/org/apache/beam/runners/flink/FlinkSavepointTest.java
##########
@@ -159,6 +157,9 @@ private void runSavepointAndRestore(boolean 
isPortablePipeline) throws Exception
     // Initial parallelism
     options.setParallelism(2);
     options.setRunner(FlinkRunner.class);
+    // Enable checkpointing interval for streaming non portable pipeline to 
avoid

Review comment:
       > I'm not sure about this, but when I set a checkpointing interval for a 
portable pipeline, it shows a timeout error like in 
https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3819/testReport/org.apache.beam.runners.flink/FlinkSavepointTest/testSavepointRestorePortable/.
   
   I'm not sure which error you are talking about? If the test passed, it's 
likely it's benign.
   
   > The reason behind this fix is to enable restart after some job failure.
   > When this test fails, continuously shows the error: "Recovery is 
suppressed by NoRestartBackoffTimeStrategy" like in 
https://scans.gradle.com/s/n2coqujl4jc7i/tests/:runners:flink:1.13:test/org.apache.beam.runners.flink.FlinkSavepointTest/testSavepointRestoreLegacy?top-execution=1.
   
   Thanks for getting the build scan. It looks like something is going wrong 
while taking the savepoint. It looks like it could be a real bug, so let's wait 
to merge this until we are sure that's not the case.
   
   ```
   Caused by: java.lang.Exception: Could not materialize checkpoint 2 for 
operator VerificationStage/ParMultiDo(Anonymous) (1/2)#0. |  
     | at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:257)
 |  
     | ... 4 more |  
     | Caused by: java.lang.IllegalArgumentException |  
     | at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) |  
     | at 
org.apache.flink.runtime.checkpoint.CheckpointMetrics.<init>(CheckpointMetrics.java:74)
 |  
     | at 
org.apache.flink.runtime.checkpoint.CheckpointMetricsBuilder.build(CheckpointMetricsBuilder.java:135)
 |  
     | at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.reportCompletedSnapshotStates(AsyncCheckpointRunnable.java:206)
 |  
     | at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:158)
 |  
     | ... 3 more
   ```
   
   It looks like the failed precondition is checking `alignmentDurationNanos`. 
I'm not sure however what the unacceptable value is, nor where it is coming 
from. 
https://github.com/apache/flink/blob/3909c9f0a11e8b38b264db9e7716fb41e75cc524/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointMetrics.java#L74




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to