[ 
https://issues.apache.org/jira/browse/FLINK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378299#comment-16378299
 ] 

Xinyang Gao commented on FLINK-8770:
------------------------------------

For more information, I have rerun the test and see the following logs, still 
checkpoint is triggered every 1s, and chaos monkey kills jm/tm in every 5 mins

-----------------------

2018-02-23 18:03:24,990 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
Recovering checkpoints from ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,090 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 
10 checkpoints in ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,090 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to fetch 10 checkpoints from storage. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,090 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12654. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,116 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12655. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,151 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12656. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,167 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12657. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,178 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12658. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,188 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12659. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,197 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12660. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,207 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12661. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,215 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12662. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,226 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12663. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,237 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from 
latest valid checkpoint: Checkpoint 12663 @ 1519408902914 for 
90bc30051d4801a5b2c778f66a192563. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,246 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to 
restore [flink-akka.actor.default-dispatcher-3]



{color:#FF0000}at this time it is good, then it failed a few time to recover 
from the checkpoint {color}

{color:#333333}2018-02-23 18:03:35,392 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 
10 checkpoints in ZooKeeper. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,393 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to fetch 10 checkpoints from storage. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,393 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12654. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,398 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12655. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,404 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12656. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,409 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12657. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,414 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12658. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,423 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12659. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,433 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12660. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,444 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12661. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,456 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12662. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,472 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12663. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,490 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from 
latest valid checkpoint: Checkpoint 12663 @ 1519408902914 for 
90bc30051d4801a5b2c778f66a192563. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,492 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to 
restore [jobmanager-future-thread-16]{color}

{color:#d04437}again at 2018-02-23 18:04:53,253, then finally succeed:{color}

2018-02-23 18:05:29,237 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to fetch 10 checkpoints from storage. [jobmanager-future-thread-16]

{color:#333333}2018-02-23 18:05:29,237 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12654. [jobmanager-future-thread-16]
2018-02-23 18:05:29,243 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12655. [jobmanager-future-thread-16]
2018-02-23 18:05:29,246 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12656. [jobmanager-future-thread-16]
2018-02-23 18:05:29,249 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12657. [jobmanager-future-thread-16]
2018-02-23 18:05:29,251 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12658. [jobmanager-future-thread-16]
2018-02-23 18:05:29,260 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12659. [jobmanager-future-thread-16]
2018-02-23 18:05:29,263 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12660. [jobmanager-future-thread-16]
2018-02-23 18:05:29,267 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12661. [jobmanager-future-thread-16]
2018-02-23 18:05:29,278 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12662. [jobmanager-future-thread-16]
2018-02-23 18:05:29,282 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12663. [jobmanager-future-thread-16]
2018-02-23 18:05:29,285 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from 
latest valid checkpoint: Checkpoint 12663 @ 1519408902914 for 
90bc30051d4801a5b2c778f66a192563. [jobmanager-future-thread-16]
2018-02-23 18:05:29,286 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to 
restore [jobmanager-future-thread-16]{color}

{color:#d04437}more checkpoints are triggered:{color}

 

{color:#333333}2018-02-23 18:05:48,710 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 12680 (75984 bytes in 349 ms). 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:05:49,332 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 12681 @ 1519409149326 [Checkpoint Timer]
2018-02-23 18:05:49,453 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 12681 (75984 bytes in 99 ms). [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:05:50,337 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 12682 @ 1519409150326 [Checkpoint Timer]
2018-02-23 18:05:50,473 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 12682 (75984 bytes in 123 ms). 
[flink-akka.actor.default-dispatcher-22]{color}

{color:#d04437}then when Jobmanager is restarted, it tries to retrieve the more 
latest checkpoint:{color}

{color:#333333}2018-02-23 18:06:43,361 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to fetch 10 checkpoints from storage. [jobmanager-future-thread-9]
2018-02-23 18:06:43,361 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12675. [jobmanager-future-thread-9]
2018-02-23 18:06:43,373 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12676. [jobmanager-future-thread-9]
2018-02-23 18:06:43,378 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12677. [jobmanager-future-thread-9]
2018-02-23 18:06:43,383 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12678. [jobmanager-future-thread-9]
2018-02-23 18:06:43,390 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12679. [jobmanager-future-thread-9]
2018-02-23 18:06:43,402 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12680. [jobmanager-future-thread-9]
2018-02-23 18:06:43,407 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12681. [jobmanager-future-thread-9]
2018-02-23 18:06:43,414 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12682. [jobmanager-future-thread-9]
2018-02-23 18:06:43,419 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12683. [jobmanager-future-thread-9]
2018-02-23 18:06:43,427 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12684. [jobmanager-future-thread-9]
2018-02-23 18:06:43,433 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from 
latest valid checkpoint: Checkpoint 12684 @ 1519409152326 for 
90bc30051d4801a5b2c778f66a192563. [jobmanager-future-thread-9]
2018-02-23 18:06:43,434 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to 
restore [jobmanager-future-thread-9]{color}

 

{color:#d04437}then I see the following exception:{color}

{color:#333333}Caused by: java.util.concurrent.ExecutionException: 
java.io.IOException: Could not flush and close the file system output stream to 
file{color}

{color:#333333} Caused by: java.io.IOException: No space left on device{color}

{color:#333333}Could not materialize checkpoint 12691 for operator Source{color}

{color:#333333}java.lang.IllegalStateException: Could not initialize keyed 
state backend.{color}

{color:#333333}Caused by: java.lang.IllegalArgumentException: Illegal value 
provided for SubCode.{color}

 

{color:#d04437}then somehow it tries to retrieve the old checkpoint {color}

 

{color:#333333}2018-02-23 18:08:10,639 WARN 
org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a 
Metric with the name 'totalNumberOfCheckpoints'. Metric will not be 
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 
'numberOfInProgressCheckpoints'. Metric will not be 
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 
'numberOfCompletedCheckpoints'. Metric will not be 
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 
'numberOfFailedCheckpoints'. Metric will not be 
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 
'lastCheckpointRestoreTimestamp'. Metric will not be 
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 'lastCheckpointSize'. 
Metric will not be reported.[flink-test-jobmanager, jobmanager, 
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 
'lastCheckpointDuration'. Metric will not be reported.[flink-test-jobmanager, 
jobmanager, passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 
'lastCheckpointAlignmentBuffered'. Metric will not be 
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 
'lastCheckpointExternalPath'. Metric will not be 
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 INFO org.apache.flink.runtime.jobmanager.JobManager - 
Loading state backend via factory 
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 INFO 
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory - State 
backend is set to RocksDB (configured DB storage paths 
[/mnt/flink-test/rocksdb-data], checkpoints to filesystem 
file:/mnt/flink-test/checkpoints ) [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,658 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 'restartingTime'. 
Metric will not be reported.[flink-test-jobmanager, jobmanager, 
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 'downtime'. Metric 
will not be reported.[flink-test-jobmanager, jobmanager, 
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 'uptime'. Metric will 
not be reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test] 
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 WARN org.apache.flink.metrics.MetricGroup - Name 
collision: Group already contains a Metric with the name 'fullRestarts'. Metric 
will not be reported.[flink-test-jobmanager, jobmanager, 
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
Recovering checkpoints from ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,838 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 
10 checkpoints in ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,838 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to fetch 10 checkpoints from storage. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,838 INFO 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying 
to retrieve checkpoint 12654. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,857 WARN 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Could 
not retrieve checkpoint, not adding to list of recovered checkpoints. 
[flink-akka.actor.default-dispatcher-3]
org.apache.flink.util.FlinkException: Could not retrieve checkpoint 12654 from 
state handle under /0000000000000012654. This indicates that the retrieved 
state handle is broken. Try cleaning the state handle store.
 at 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.retrieveCompletedCheckpoint(ZooKeeperCompletedCheckpointStore.java:410)
 at 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.recover(ZooKeeperCompletedCheckpointStore.java:192)
 at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1061)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.restoreLatestCheckpointedState(ExecutionGraph.java:1215)
 at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1339)
 at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1336)
 at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1336)
 at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
 at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
 at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.FileNotFoundException: 
/mnt/flink-test/recovery/completedCheckpointb2aae161b3e9 (No such file or 
directory)
 at java.io.FileInputStream.open0(Native Method)
 at java.io.FileInputStream.open(FileInputStream.java:195)
 at java.io.FileInputStream.<init>(FileInputStream.java:138)
 at 
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
 at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
 at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:70)
 at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)
 at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)
 at 
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.retrieveCompletedCheckpoint(ZooKeeperCompletedCheckpointStore.java:404)
 ... 14 more{color}

 

 

 

 

 

 

 

 

 

 

 

> CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager 
> is restarted it fails to recover the job due to "checkpoint FileNotFound 
> exception"
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8770
>                 URL: https://issues.apache.org/jira/browse/FLINK-8770
>             Project: Flink
>          Issue Type: Bug
>          Components: Local Runtime
>    Affects Versions: 1.4.0
>            Reporter: Xinyang Gao
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> Hi, I am running a Flink cluster (1 JobManager + 6 TaskManagers) with HA mode 
> on OpenShift, I have enabled Chaos Monkey which kills either JobManager or 
> one of the TaskManager in every 5 minutes, ZooKeeper quorum is stable with no 
> chaos monkey enabled. Flink reads data from one Kafka topic and writes data 
> into another Kafka topic. Checkpoint surely is enabled, with 1000ms interval. 
> state.checkpoints.num-retained is set to 10. I am using PVC for state backend 
> (checkpoint, recovery, etc), so the checkpoints and states are persistent. 
> The restart strategy for Flink jobmanager DeploymentConfig is 
> {color:#d04437}recreate, {color:#333333}which means it will kill the old 
> container of jobmanager before it restarts the new one.{color}{color}
> I have run the Chaos test for one day at first, however I have seen the 
> exception:
> {color:#FF0000}org.apache.flink.util.FlinkException: Could not retrieve 
> checkpoint *** from state handle under /***. This indicates that the 
> retrieved state handle is broken. Try cleaning the state handle store. 
> {color:#333333}and the root cause is checkpoint 
> {color:#d04437}FileNotFound{color}. {color}{color}
> {color:#FF0000}{color:#333333}then the Flink job keeps restarting for a few 
> hours and due to the above error it cannot be restarted successfully. 
> {color}{color}
> {color:#FF0000}{color:#333333}After further investigation, I have found the 
> following facts in my PVC:{color}{color}
>  
> {color:#d04437}-rw-r--r--. 1 flink root 11379 Feb 23 02:10 
> completedCheckpoint0ee95157de00
> -rw-r--r--. 1 flink root 11379 Feb 23 01:51 completedCheckpoint498d0952cf00
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint650fe5b021fe
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint66634149683e
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint67f24c3b018e
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint6f64ebf0ae64
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint906ebe1fb337
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint98b79ea14b09
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpointa0d1070e0b6c
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpointbd3a9ba50322
> -rw-r--r--. 1 flink root 11355 Feb 22 17:31 completedCheckpointd433b5e108be
> -rw-r--r--. 1 flink root 11379 Feb 22 22:56 completedCheckpointdd0183ed092b
> -rw-r--r--. 1 flink root 11379 Feb 22 00:00 completedCheckpointe0a5146c3d81
> -rw-r--r--. 1 flink root 11331 Feb 22 17:06 completedCheckpointec82f3ebc2ad
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 
> completedCheckpointf86e460f6720{color}
>  
> {color:#333333}The latest 10 checkpoints are created at about 02:10, if you 
> ignore the old checkpoints which were not deleted successfully (which I do 
> not care too much).{color}
>  
> {color:#333333}However when checking on ZooKeeper, I see the followings in 
> flink/checkpoints path (I only list one, but the other 9 are similar){color}
> {color:#d04437}cZxid = 0x160001ff5d
> ��sr;org.apache.flink.runtime.state.RetrievableStreamStateHandle�U�+LwrappedStreamStateHandlet2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr9org.apache.flink.runtime.state.filesystem.FileStateHandle�u�b�▒▒J
>  
> stateSizefilePathtLorg/apache/flink/core/fs/Path;xp,ssrorg.apache.flink.core.fs.PathLuritLjava/net/URI;xpsr
>  
> java.net.URI�x.C�I�LstringtLjava/lang/String;xpt=file:/mnt/flink-test/recovery/completedCheckpointd004a3753870x
> [zk: localhost:2181(CONNECTED) 7] ctime = Fri Feb 23 02:08:18 UTC 2018
> mZxid = 0x160001ff5d
> mtime = Fri Feb 23 02:08:18 UTC 2018
> pZxid = 0x1d00000c6d
> cversion = 31
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 492{color}
> {color:#FF0000}{color:#333333} {color}{color}
> so the latest completedCheckpoints status stored on ZooKeeper is at about 
> {color:#d04437}02:08, {color:#333333}which implies that the completed 
> checkpoints at{color}{color} {color:#d04437}02:10 {color:#333333}somehow are 
> not successfully submitted to ZooKpeer, so when it tries to restart the Flink 
> job it is not able to find the latest checkpoint thus being 
> failed.{color}{color}
> {color:#d04437}{color:#333333}I am very suprised by this since seems writing 
> checkpoint to zookeeper is [synchronous 
> |https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L222],
>  so I am not sure why this happens. Can anyone help looks at this 
> ?{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to