Jin Xing created FLINK-22672:
--------------------------------

             Summary: Some enhancements for pluggable shuffle service framework
                 Key: FLINK-22672
                 URL: https://issues.apache.org/jira/browse/FLINK-22672
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Network
            Reporter: Jin Xing


"Pluggable shuffle service" in Flink provides an architecture which are unified 
for both streaming and batch jobs, allowing user to customize the process of 
data transfer between shuffle stages according to scenarios.

There are already a number of implementations of "remote shuffle service" on 
Spark like [1][2][3]. Remote shuffle enables to shuffle data from/to a remote 
cluster and achieves benefits like :
 # The lifecycle of computing resource can be decoupled with shuffle data, once 
computing task is finished, idle computing nodes can be released with its 
completed shuffle data accormadated on remote shuffle cluster.
 # There is no need to reserve disk capacity for shuffle on computing nodes. 
Remote shuffle cluster serves shuffling request with better scaling ability and 
alleviates the local disk pressure on computing nodes when data skew.

Based "pluggable shuffle service", we build our own "remote shuffle service" on 
Flink -- Lattice, which targets to provide functionalities and improve 
performance for batch processing jobs. Basically it works as below:
 # Lattice cluster works as an independent service for shuffling request;
 # LatticeShuffleMaster extends ShuffleMaster, works inside JM and talks with 
remote Lattice cluster for shuffle resouce application and shuffle data 
lifecycle management;
 # LatticeShuffleEnvironmente extends ShuffleEnvironment, works inside TM and 
provides an environment for shuffling data from/to remote Lattice cluster;

During the process of building Lattice we find some potential enhancements on 
"pluggable shuffle service". I will enumerate and create some sub JIRAs under 
this umbrella

 

[1] 
[https://www.alibabacloud.com/blog/emr-remote-shuffle-service-a-powerful-elastic-tool-of-serverless-spark_597728]

[2] [https://bestoreo.github.io/post/cosco/cosco/]

[3] [https://github.com/uber/RemoteShuffleService]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to