[ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789390#comment-17789390
 ] 

Yue Ma commented on FLINK-27681:
--------------------------------

[~fanrui]
{quote}But I still don't know why the file is corrupted, would you mind 
describing it in detail?
{quote}
In our production environment, most files are damaged due to hardware failures 
on the machine where the file is written, such as (Memory CE or SSD disk 
hardware failure). Under the default Rocksdb option, after a damaged SST is 
created, if there is no Compaction or Get/Iterator to access this file, DB can 
always run normally. But when the task fails and recovers from Checkpoint, 
there may be other Get requests or Compactions that will read this file, and 
the task will fail at this time.
{quote} Is it possible that file corruption occurs after flink check but before 
uploading the file to hdfs?
{quote}
Strictly speaking, I think it is possible for file corruption to occur during 
the process of uploading and downloading to local. It might be better if Flink 
can add the file verification mechanism during Checkpoint upload and download 
processes. But as far as I know, most DFSs have data verification mechanisms, 
so at least we have not encountered this situation in our production 
environment. Most file corruption occurs before being uploaded to HDFS.

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to