[
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408693#comment-17408693
]
Paul Lin commented on FLINK-19816:
----------------------------------
Still getting this error with Flink 1.12.3. The jobmanager logs are attached,
please take a look. [~trohrmann] [^jm.log]
> Flink restored from a wrong checkpoint (a very old one and not the last
> completed one)
> --------------------------------------------------------------------------------------
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.11.0, 1.12.0
> Reporter: Steven Zhen Wu
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.11.3, 1.12.0
>
> Attachments: jm.log
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed
> checkpoint. Instead, it restored from a very old checkpoint. As a result,
> Kafka offsets are invalid and caused the job to replay from the beginning as
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000.
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
> [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
> [cluster-io-thread-5] - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem
> [cluster-io-thread-25] - Deleting path:
> s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
> [cluster-io-thread-5] - Trying to recover job with job id
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem
> [cluster-io-thread-25] - Deleting path:
> s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints
> from storage.
> 2020-10-25 02:36:03,432 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003] - Starting job
> 233b4938179c06974e4535ac8a868675 from savepoint
> s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
> ()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)