[jira] [Updated] (FLINK-34984) FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)

Yuan Mei (Jira) Mon, 01 Apr 2024 04:56:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-34984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuan Mei updated FLINK-34984:
-----------------------------
    Description: 
The past decade has witnessed a dramatic shift in Flink's deployment mode, 
workload patterns, and hardware improvements. We've moved from the map-reduce 
era where workers are computation-storage tightly coupled nodes to a 
cloud-native world where containerized deployments on Kubernetes become 
standard. To enable Flink's Cloud-Native future, we introduce Disaggregated 
State Storage and Management that uses DFS as primary storage in Flink 2.0

 

Design Details can be found in 
[FLIP-423|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855].

This new architecture is aimed to solve the following challenges brought in the 
cloud-native era for Flink.
1. Local Disk Constraints in containerization
2. Spiky Resource Usage caused by compaction in the current state model
3. Fast Rescaling for jobs with large states (hundreds of Terabytes)
4. Light and Fast Checkpoint in a native way

  was:
The past decade has witnessed a dramatic shift in Flink's deployment mode, 
workload patterns, and hardware improvements. We've moved from the map-reduce 
era where workers are computation-storage tightly coupled nodes to a 
cloud-native world where containerized deployments on Kubernetes become 
standard. To enable Flink's Cloud-Native future, we introduce Disaggregated 
State Storage and Management that uses DFS as primary storage in Flink 2.0, as 
promised in the Flink 2.0 Roadmap.

Detailed design and story: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855

Also sub-FLIPs:
- Asynchronous State APIs 
([FLIP-424|https://cwiki.apache.org/confluence/x/SYp3EQ]): Introduce new APIs 
for asynchronous state access. 
- Asynchronous Execution Model 
([FLIP-425|https://cwiki.apache.org/confluence/x/S4p3EQ]): Implement a 
non-blocking execution model leveraging the asynchronous APIs introduced in 
FLIP-424. 
- Grouping Remote State Access 
([FLIP-426|https://cwiki.apache.org/confluence/x/TYp3EQ]): Enable retrieval of 
remote state data in batches to avoid unnecessary round-trip costs for remote 
access. 
- Disaggregated State Store 
([FLIP-427|https://cwiki.apache.org/confluence/x/T4p3EQ]): Introduce the 
initial version of the ForSt disaggregated state store.
- Fault Tolerance/Rescale Integration 
([FLIP-428|https://cwiki.apache.org/confluence/x/UYp3EQ]): Integrate 
checkpointing mechanisms with the disaggregated state store for fault tolerance 
and fast rescaling.


> FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)
> --------------------------------------------------------------------
>
>                 Key: FLINK-34984
>                 URL: https://issues.apache.org/jira/browse/FLINK-34984
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / Core, API / DataStream, Runtime / Checkpointing, 
> Runtime / State Backends
>            Reporter: Yuan Mei
>            Priority: Major
>
> The past decade has witnessed a dramatic shift in Flink's deployment mode, 
> workload patterns, and hardware improvements. We've moved from the map-reduce 
> era where workers are computation-storage tightly coupled nodes to a 
> cloud-native world where containerized deployments on Kubernetes become 
> standard. To enable Flink's Cloud-Native future, we introduce Disaggregated 
> State Storage and Management that uses DFS as primary storage in Flink 2.0
>  
> Design Details can be found in 
> [FLIP-423|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855].
> This new architecture is aimed to solve the following challenges brought in 
> the cloud-native era for Flink.
> 1. Local Disk Constraints in containerization
> 2. Spiky Resource Usage caused by compaction in the current state model
> 3. Fast Rescaling for jobs with large states (hundreds of Terabytes)
> 4. Light and Fast Checkpoint in a native way



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34984) FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)

Reply via email to