Steven Zhen Wu created FLINK-19816:
--------------------------------------
Summary: Flink restored from a wrong checkpoint (a very old one
and not the last completed one)
Key: FLINK-19816
URL: https://issues.apache.org/jira/browse/FLINK-19816
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.11.0
Reporter: Steven Zhen Wu
h2. Summary
Upon failure, it seems that Flink didn't restore from the last completed
checkpoint. Instead, it restored from a very old checkpoint. As a result, Kafka
offsets are invalid and caused the job to replay from the beginning as Kafka
consumer "auto.offset.reset" was set to "EARLIEST".
This is an embarrassingly parallel stateless job. Parallelism is over 1,000.
h2. Sequence of events from the logs
Just before the failure, checkpoint *210768* completed.
{code}
2020-10-25 02:35:05,970 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
[jobmanager-future-thread-5] - Completed checkpoint 210768 for job
233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
{code}
During restart, somehow it decided to restore from checkpoint *203531*.
{code:java}
2020-10-25 02:36:03,301 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
[cluster-io-thread-3] - Start SessionDispatcherLeaderProcess.
2020-10-25 02:36:03,302 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
[cluster-io-thread-5] - Recover all persisted job graphs.
2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem
[cluster-io-thread-25] - Deleting path:
s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
2020-10-25 02:36:03,307 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
[cluster-io-thread-5] - Trying to recover job with job id
233b4938179c06974e4535ac8a868675.
2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem
[cluster-io-thread-25] - Deleting path:
s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
...
2020-10-25 02:36:03,427 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
[flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from
ZooKeeper.
2020-10-25 02:36:03,432 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
[flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in ZooKeeper.
2020-10-25 02:36:03,432 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
[flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints
from storage.
2020-10-25 02:36:03,432 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
[flink-akka.actor.default-dispatcher-82003] - Starting job
233b4938179c06974e4535ac8a868675 from savepoint
s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
()
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)