Thanks for driving this effort, xiangyu! The proposal overall LGTM. I just have a small question. There are other places in Flink that interact with external storage. Should we consider adding a general retry mechanism to them?
xiangyu feng <xiangyu...@gmail.com> 于2024年1月8日周一 11:31写道: > Hi devs, > > I'm opening this thread to discuss FLIP-414: Support Retry Mechanism in > RocksDBStateDataTransfer[1]. > > Currently, there is no retry mechanism for downloading and uploading > RocksDB state files. Any jittering of remote filesystem might lead to a > checkpoint failure. By supporting retry mechanism in > `RocksDBStateDataTransfer`, we can significantly reduce the failure rate of > checkpoint during asynchronous phrase. > > To make this retry mechanism configurable, we have introduced two options > in this FLIP: `state.backend.rocksdb.checkpoint.transfer.retry.times` and ` > state.backend.rocksdb.checkpoint.transfer.retry.interval`. The default > behavior remains to be no retry will be performed in order to be consistent > with the original behavior. > > Looking forward to your feedback, thanks. > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-414%3A+Support+Retry+Mechanism+in+RocksDBStateDataTransfer > > Best regards, > Xiangyu Feng > -- Best, Yue