Jinzhong Li created FLINK-34982:
-----------------------------------
Summary: FLIP-428: Fault Tolerance/Rescale Integration for
Disaggregated State
Key: FLINK-34982
URL: https://issues.apache.org/jira/browse/FLINK-34982
Project: Flink
Issue Type: New Feature
Components: Runtime / Checkpointing, Runtime / State Backends
Reporter: Jinzhong Li
Fix For: 2.0.0
This is a sub-FLIP for the disaggregated state management and its related work,
please read the [FLIP-423|https://cwiki.apache.org/confluence/x/R4p3EQ] first
to know the whole story.
As outlined in [FLIP-423|https://cwiki.apache.org/confluence/x/R4p3EQ] [1] and
[FLIP-427|https://cwiki.apache.org/confluence/x/T4p3EQ] [2], we proposed to
disaggregate StateManagement and introduced a disaggregated state storage named
ForSt, which evolves from RocksDB. Within the new framework, where the primary
storage is placed on the remote file system, several challenges emerge when
attempting to reuse the existing fault-tolerance mechanisms of local RocksDB:
* Because most remote file system don't support hard-link, ForSt can't utilize
hard-link to capture a consistent snapshot during checkpoint synchronous phase
as rocksdb currently does.
* The existing file transfer mechanism within RocksDB is inefficient during
checkpoints; it involves first downloading the remote working state data to
local memory and then uploading it to the checkpoint directory. Likewise, both
restore and rescale face the similar problems due to superfluous data
transmission.
In order to solve the above problems and improve checkpoint/restore/rescaling
performance of disaggregated storage, this FLIP proposes:
# A new checkpoint strategy for disaggregated state storage: leverage
RocksDB's low-level api to retain a consistent snapshot during the checkpoint
synchronous phase; and then transfer the snapshot files to checkpoint directory
during asynchronous phase;
# Accelerating checkpoint/restore/rescaling by leverage fast-duplication of
remote file system, which can bypass the local TaskManager when transferring
data between remote working directory and checkpoint directory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)