[
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759475#comment-17759475
]
Hangxiang Yu commented on FLINK-27681:
--------------------------------------
[~mayuehappy] Do you mean calling db.VerifyChecksum() in the async thread of
checkpoint ?
I just rethinked all interfaces rocksdb provided, this may also bring too much
cost which may result in unavailable checkpoint when enabling this option.
{code:java}
the API call may take a significant amount of time to finish{code}
I think the best way is to only verify the checksum of incremental SST to
reduce the cost, but seems rocksdb haven't provided the interface to verify in
the SST level.
> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
> Key: FLINK-27681
> URL: https://issues.apache.org/jira/browse/FLINK-27681
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / State Backends
> Reporter: Ming Li
> Priority: Critical
> Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or
> the block verification fails when the job is restored. The reason for this
> situation is generally that there are some problems with the machine where
> the task is located, which causes the files uploaded to HDFS to be incorrect,
> but it has been a long time (a dozen minutes to half an hour) when we found
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time,
> when the maximum number of checkpoints reserved is exceeded, we can only use
> this file until it is no longer referenced. When the job failed, it cannot be
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is
> a problem with the checkpoint data, because even with manual intervention, it
> just tries to recover from the existing checkpoint or discard the entire
> state.
> 3. Can we increase the maximum number of references to a file based on the
> maximum number of checkpoints reserved? When the number of references exceeds
> the maximum number of checkpoints -1, the Task side is required to upload a
> new file for this reference. Not sure if this way will ensure that the new
> file we upload will be correct.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)