[
https://issues.apache.org/jira/browse/FLINK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378299#comment-16378299
]
Xinyang Gao commented on FLINK-8770:
------------------------------------
For more information, I have rerun the test and see the following logs, still
checkpoint is triggered every 1s, and chaos monkey kills jm/tm in every 5 mins
-----------------------
2018-02-23 18:03:24,990 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,090 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found
10 checkpoints in ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,090 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 10 checkpoints from storage. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,090 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12654. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,116 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12655. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,151 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12656. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,167 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12657. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,178 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12658. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,188 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12659. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,197 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12660. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,207 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12661. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,215 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12662. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,226 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12663. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,237 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from
latest valid checkpoint: Checkpoint 12663 @ 1519408902914 for
90bc30051d4801a5b2c778f66a192563. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:03:25,246 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to
restore [flink-akka.actor.default-dispatcher-3]
{color:#FF0000}at this time it is good, then it failed a few time to recover
from the checkpoint {color}
{color:#333333}2018-02-23 18:03:35,392 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found
10 checkpoints in ZooKeeper. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,393 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 10 checkpoints from storage. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,393 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12654. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,398 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12655. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,404 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12656. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,409 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12657. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,414 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12658. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,423 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12659. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,433 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12660. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,444 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12661. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,456 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12662. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,472 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12663. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,490 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from
latest valid checkpoint: Checkpoint 12663 @ 1519408902914 for
90bc30051d4801a5b2c778f66a192563. [jobmanager-future-thread-16]{color}
{color:#333333}2018-02-23 18:03:35,492 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to
restore [jobmanager-future-thread-16]{color}
{color:#d04437}again at 2018-02-23 18:04:53,253, then finally succeed:{color}
2018-02-23 18:05:29,237 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 10 checkpoints from storage. [jobmanager-future-thread-16]
{color:#333333}2018-02-23 18:05:29,237 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12654. [jobmanager-future-thread-16]
2018-02-23 18:05:29,243 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12655. [jobmanager-future-thread-16]
2018-02-23 18:05:29,246 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12656. [jobmanager-future-thread-16]
2018-02-23 18:05:29,249 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12657. [jobmanager-future-thread-16]
2018-02-23 18:05:29,251 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12658. [jobmanager-future-thread-16]
2018-02-23 18:05:29,260 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12659. [jobmanager-future-thread-16]
2018-02-23 18:05:29,263 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12660. [jobmanager-future-thread-16]
2018-02-23 18:05:29,267 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12661. [jobmanager-future-thread-16]
2018-02-23 18:05:29,278 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12662. [jobmanager-future-thread-16]
2018-02-23 18:05:29,282 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12663. [jobmanager-future-thread-16]
2018-02-23 18:05:29,285 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from
latest valid checkpoint: Checkpoint 12663 @ 1519408902914 for
90bc30051d4801a5b2c778f66a192563. [jobmanager-future-thread-16]
2018-02-23 18:05:29,286 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to
restore [jobmanager-future-thread-16]{color}
{color:#d04437}more checkpoints are triggered:{color}
{color:#333333}2018-02-23 18:05:48,710 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed
checkpoint 12680 (75984 bytes in 349 ms).
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:05:49,332 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering
checkpoint 12681 @ 1519409149326 [Checkpoint Timer]
2018-02-23 18:05:49,453 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed
checkpoint 12681 (75984 bytes in 99 ms). [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:05:50,337 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering
checkpoint 12682 @ 1519409150326 [Checkpoint Timer]
2018-02-23 18:05:50,473 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed
checkpoint 12682 (75984 bytes in 123 ms).
[flink-akka.actor.default-dispatcher-22]{color}
{color:#d04437}then when Jobmanager is restarted, it tries to retrieve the more
latest checkpoint:{color}
{color:#333333}2018-02-23 18:06:43,361 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 10 checkpoints from storage. [jobmanager-future-thread-9]
2018-02-23 18:06:43,361 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12675. [jobmanager-future-thread-9]
2018-02-23 18:06:43,373 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12676. [jobmanager-future-thread-9]
2018-02-23 18:06:43,378 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12677. [jobmanager-future-thread-9]
2018-02-23 18:06:43,383 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12678. [jobmanager-future-thread-9]
2018-02-23 18:06:43,390 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12679. [jobmanager-future-thread-9]
2018-02-23 18:06:43,402 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12680. [jobmanager-future-thread-9]
2018-02-23 18:06:43,407 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12681. [jobmanager-future-thread-9]
2018-02-23 18:06:43,414 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12682. [jobmanager-future-thread-9]
2018-02-23 18:06:43,419 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12683. [jobmanager-future-thread-9]
2018-02-23 18:06:43,427 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12684. [jobmanager-future-thread-9]
2018-02-23 18:06:43,433 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from
latest valid checkpoint: Checkpoint 12684 @ 1519409152326 for
90bc30051d4801a5b2c778f66a192563. [jobmanager-future-thread-9]
2018-02-23 18:06:43,434 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to
restore [jobmanager-future-thread-9]{color}
{color:#d04437}then I see the following exception:{color}
{color:#333333}Caused by: java.util.concurrent.ExecutionException:
java.io.IOException: Could not flush and close the file system output stream to
file{color}
{color:#333333} Caused by: java.io.IOException: No space left on device{color}
{color:#333333}Could not materialize checkpoint 12691 for operator Source{color}
{color:#333333}java.lang.IllegalStateException: Could not initialize keyed
state backend.{color}
{color:#333333}Caused by: java.lang.IllegalArgumentException: Illegal value
provided for SubCode.{color}
{color:#d04437}then somehow it tries to retrieve the old checkpoint {color}
{color:#333333}2018-02-23 18:08:10,639 WARN
org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a
Metric with the name 'totalNumberOfCheckpoints'. Metric will not be
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name
'numberOfInProgressCheckpoints'. Metric will not be
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name
'numberOfCompletedCheckpoints'. Metric will not be
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name
'numberOfFailedCheckpoints'. Metric will not be
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name
'lastCheckpointRestoreTimestamp'. Metric will not be
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name 'lastCheckpointSize'.
Metric will not be reported.[flink-test-jobmanager, jobmanager,
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name
'lastCheckpointDuration'. Metric will not be reported.[flink-test-jobmanager,
jobmanager, passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name
'lastCheckpointAlignmentBuffered'. Metric will not be
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name
'lastCheckpointExternalPath'. Metric will not be
reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 INFO org.apache.flink.runtime.jobmanager.JobManager -
Loading state backend via factory
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,640 INFO
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory - State
backend is set to RocksDB (configured DB storage paths
[/mnt/flink-test/rocksdb-data], checkpoints to filesystem
file:/mnt/flink-test/checkpoints ) [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,658 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name 'restartingTime'.
Metric will not be reported.[flink-test-jobmanager, jobmanager,
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name 'downtime'. Metric
will not be reported.[flink-test-jobmanager, jobmanager,
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name 'uptime'. Metric will
not be reported.[flink-test-jobmanager, jobmanager, passthroughjob-flink-test]
[flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 WARN org.apache.flink.metrics.MetricGroup - Name
collision: Group already contains a Metric with the name 'fullRestarts'. Metric
will not be reported.[flink-test-jobmanager, jobmanager,
passthroughjob-flink-test] [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,659 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,838 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found
10 checkpoints in ZooKeeper. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,838 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 10 checkpoints from storage. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,838 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 12654. [flink-akka.actor.default-dispatcher-3]
2018-02-23 18:08:10,857 WARN
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Could
not retrieve checkpoint, not adding to list of recovered checkpoints.
[flink-akka.actor.default-dispatcher-3]
org.apache.flink.util.FlinkException: Could not retrieve checkpoint 12654 from
state handle under /0000000000000012654. This indicates that the retrieved
state handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.retrieveCompletedCheckpoint(ZooKeeperCompletedCheckpointStore.java:410)
at
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.recover(ZooKeeperCompletedCheckpointStore.java:192)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1061)
at
org.apache.flink.runtime.executiongraph.ExecutionGraph.restoreLatestCheckpointedState(ExecutionGraph.java:1215)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1339)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1336)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1336)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.FileNotFoundException:
/mnt/flink-test/recovery/completedCheckpointb2aae161b3e9 (No such file or
directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
at
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
at
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:70)
at
org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)
at
org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)
at
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.retrieveCompletedCheckpoint(ZooKeeperCompletedCheckpointStore.java:404)
... 14 more{color}
> CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager
> is restarted it fails to recover the job due to "checkpoint FileNotFound
> exception"
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-8770
> URL: https://issues.apache.org/jira/browse/FLINK-8770
> Project: Flink
> Issue Type: Bug
> Components: Local Runtime
> Affects Versions: 1.4.0
> Reporter: Xinyang Gao
> Priority: Blocker
> Fix For: 1.5.0
>
>
> Hi, I am running a Flink cluster (1 JobManager + 6 TaskManagers) with HA mode
> on OpenShift, I have enabled Chaos Monkey which kills either JobManager or
> one of the TaskManager in every 5 minutes, ZooKeeper quorum is stable with no
> chaos monkey enabled. Flink reads data from one Kafka topic and writes data
> into another Kafka topic. Checkpoint surely is enabled, with 1000ms interval.
> state.checkpoints.num-retained is set to 10. I am using PVC for state backend
> (checkpoint, recovery, etc), so the checkpoints and states are persistent.
> The restart strategy for Flink jobmanager DeploymentConfig is
> {color:#d04437}recreate, {color:#333333}which means it will kill the old
> container of jobmanager before it restarts the new one.{color}{color}
> I have run the Chaos test for one day at first, however I have seen the
> exception:
> {color:#FF0000}org.apache.flink.util.FlinkException: Could not retrieve
> checkpoint *** from state handle under /***. This indicates that the
> retrieved state handle is broken. Try cleaning the state handle store.
> {color:#333333}and the root cause is checkpoint
> {color:#d04437}FileNotFound{color}. {color}{color}
> {color:#FF0000}{color:#333333}then the Flink job keeps restarting for a few
> hours and due to the above error it cannot be restarted successfully.
> {color}{color}
> {color:#FF0000}{color:#333333}After further investigation, I have found the
> following facts in my PVC:{color}{color}
>
> {color:#d04437}-rw-r--r--. 1 flink root 11379 Feb 23 02:10
> completedCheckpoint0ee95157de00
> -rw-r--r--. 1 flink root 11379 Feb 23 01:51 completedCheckpoint498d0952cf00
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint650fe5b021fe
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint66634149683e
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint67f24c3b018e
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint6f64ebf0ae64
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint906ebe1fb337
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint98b79ea14b09
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpointa0d1070e0b6c
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpointbd3a9ba50322
> -rw-r--r--. 1 flink root 11355 Feb 22 17:31 completedCheckpointd433b5e108be
> -rw-r--r--. 1 flink root 11379 Feb 22 22:56 completedCheckpointdd0183ed092b
> -rw-r--r--. 1 flink root 11379 Feb 22 00:00 completedCheckpointe0a5146c3d81
> -rw-r--r--. 1 flink root 11331 Feb 22 17:06 completedCheckpointec82f3ebc2ad
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11
> completedCheckpointf86e460f6720{color}
>
> {color:#333333}The latest 10 checkpoints are created at about 02:10, if you
> ignore the old checkpoints which were not deleted successfully (which I do
> not care too much).{color}
>
> {color:#333333}However when checking on ZooKeeper, I see the followings in
> flink/checkpoints path (I only list one, but the other 9 are similar){color}
> {color:#d04437}cZxid = 0x160001ff5d
> ��sr;org.apache.flink.runtime.state.RetrievableStreamStateHandle�U�+LwrappedStreamStateHandlet2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr9org.apache.flink.runtime.state.filesystem.FileStateHandle�u�b�▒▒J
>
> stateSizefilePathtLorg/apache/flink/core/fs/Path;xp,ssrorg.apache.flink.core.fs.PathLuritLjava/net/URI;xpsr
>
> java.net.URI�x.C�I�LstringtLjava/lang/String;xpt=file:/mnt/flink-test/recovery/completedCheckpointd004a3753870x
> [zk: localhost:2181(CONNECTED) 7] ctime = Fri Feb 23 02:08:18 UTC 2018
> mZxid = 0x160001ff5d
> mtime = Fri Feb 23 02:08:18 UTC 2018
> pZxid = 0x1d00000c6d
> cversion = 31
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 492{color}
> {color:#FF0000}{color:#333333} {color}{color}
> so the latest completedCheckpoints status stored on ZooKeeper is at about
> {color:#d04437}02:08, {color:#333333}which implies that the completed
> checkpoints at{color}{color} {color:#d04437}02:10 {color:#333333}somehow are
> not successfully submitted to ZooKpeer, so when it tries to restart the Flink
> job it is not able to find the latest checkpoint thus being
> failed.{color}{color}
> {color:#d04437}{color:#333333}I am very suprised by this since seems writing
> checkpoint to zookeeper is [synchronous
> |https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L222],
> so I am not sure why this happens. Can anyone help looks at this
> ?{color}{color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)