[
https://issues.apache.org/jira/browse/FLINK-34984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuan Mei updated FLINK-34984:
-----------------------------
Description:
The past decade has witnessed a dramatic shift in Flink's deployment mode,
workload patterns, and hardware improvements. We've moved from the map-reduce
era where workers are computation-storage tightly coupled nodes to a
cloud-native world where containerized deployments on Kubernetes become
standard. To enable Flink's Cloud-Native future, we introduce Disaggregated
State Storage and Management that uses DFS as primary storage in Flink 2.0
Design Details can be found in
[FLIP-423|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855].
This new architecture is aimed to solve the following challenges brought in the
cloud-native era for Flink.
1. Local Disk Constraints in containerization
2. Spiky Resource Usage caused by compaction in the current state model
3. Fast Rescaling for jobs with large states (hundreds of Terabytes)
4. Light and Fast Checkpoint in a native way
was:
The past decade has witnessed a dramatic shift in Flink's deployment mode,
workload patterns, and hardware improvements. We've moved from the map-reduce
era where workers are computation-storage tightly coupled nodes to a
cloud-native world where containerized deployments on Kubernetes become
standard. To enable Flink's Cloud-Native future, we introduce Disaggregated
State Storage and Management that uses DFS as primary storage in Flink 2.0, as
promised in the Flink 2.0 Roadmap.
Detailed design and story:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855
Also sub-FLIPs:
- Asynchronous State APIs
([FLIP-424|https://cwiki.apache.org/confluence/x/SYp3EQ]): Introduce new APIs
for asynchronous state access.
- Asynchronous Execution Model
([FLIP-425|https://cwiki.apache.org/confluence/x/S4p3EQ]): Implement a
non-blocking execution model leveraging the asynchronous APIs introduced in
FLIP-424.
- Grouping Remote State Access
([FLIP-426|https://cwiki.apache.org/confluence/x/TYp3EQ]): Enable retrieval of
remote state data in batches to avoid unnecessary round-trip costs for remote
access.
- Disaggregated State Store
([FLIP-427|https://cwiki.apache.org/confluence/x/T4p3EQ]): Introduce the
initial version of the ForSt disaggregated state store.
- Fault Tolerance/Rescale Integration
([FLIP-428|https://cwiki.apache.org/confluence/x/UYp3EQ]): Integrate
checkpointing mechanisms with the disaggregated state store for fault tolerance
and fast rescaling.
> FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)
> --------------------------------------------------------------------
>
> Key: FLINK-34984
> URL: https://issues.apache.org/jira/browse/FLINK-34984
> Project: Flink
> Issue Type: New Feature
> Components: API / Core, API / DataStream, Runtime / Checkpointing,
> Runtime / State Backends
> Reporter: Yuan Mei
> Priority: Major
>
> The past decade has witnessed a dramatic shift in Flink's deployment mode,
> workload patterns, and hardware improvements. We've moved from the map-reduce
> era where workers are computation-storage tightly coupled nodes to a
> cloud-native world where containerized deployments on Kubernetes become
> standard. To enable Flink's Cloud-Native future, we introduce Disaggregated
> State Storage and Management that uses DFS as primary storage in Flink 2.0
>
> Design Details can be found in
> [FLIP-423|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855].
> This new architecture is aimed to solve the following challenges brought in
> the cloud-native era for Flink.
> 1. Local Disk Constraints in containerization
> 2. Spiky Resource Usage caused by compaction in the current state model
> 3. Fast Rescaling for jobs with large states (hundreds of Terabytes)
> 4. Light and Fast Checkpoint in a native way
--
This message was sent by Atlassian Jira
(v8.20.10#820010)