Shekhar Prasad Rajak created SPARK-55609:
--------------------------------------------

             Summary:  Arrow Flight-Based Shuffle for Apache Spark
                 Key: SPARK-55609
                 URL: https://issues.apache.org/jira/browse/SPARK-55609
             Project: Spark
          Issue Type: New Feature
          Components: Block Manager, Shuffle, Spark Core
    Affects Versions: 4.1.1
            Reporter: Shekhar Prasad Rajak


This proposal introduces Apache Arrow Flight as an alternative shuffle 
transport mechanism for Apache Spark. The current *Netty-based block transfer 
would be replaced with Arrow Flight's gRPC streaming protocol, enabling 
zero-copy* network transfer of columnar data, true bidirectional streaming, 
native columnar shuffle using Apache Arrow format, and projected 2-5x 
improvement in shuffle-heavy workloads.

 

The existing Spark shuffle implementation has several performance 
{*}bottlenecks{*}:

Problem 1 - *Row-based serialization:* Spark uses UnsafeRow format, requiring 
per-row serialization and deserialization overhead.

Problem 2 - {*}Multiple memory copies{*}: Data is copied 6-8 times between 
mapper and reducer due to serialization, disk I/O, and network transfer 
boundaries.

Problem 3 - \{*}Request-response model{*}: The current Netty-based transport 
uses a request-response pattern rather than true streaming, introducing latency 
for large data transfers.

Problem 4 - *CPU overhead:* Java serialization consumes significant CPU cycles, 
especially for high-throughput workloads.

 

*+The proposed architecture+* replaces the Netty-based block transfer with 
Arrow Flight gRPC streaming. Each executor runs a Flight server that serves 
shuffle data directly to requesting reducers.

 

Design and approach discussion: 
[https://docs.google.com/document/d/14Gki2k-Lkr7P5n8tAHGi44XYDpr-YomRFFh-kFh3-54/edit?usp=sharing]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to