[
https://issues.apache.org/jira/browse/SPARK-55609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060915#comment-18060915
]
Shekhar Prasad Rajak commented on SPARK-55609:
----------------------------------------------
How can I assign this ticket to myself.
> SPIP: Arrow Flight-Based Shuffle for Apache Spark
> --------------------------------------------------
>
> Key: SPARK-55609
> URL: https://issues.apache.org/jira/browse/SPARK-55609
> Project: Spark
> Issue Type: New Feature
> Components: Block Manager, Shuffle, Spark Core
> Affects Versions: 4.1.1
> Reporter: Shekhar Prasad Rajak
> Priority: Major
>
> This proposal introduces Apache Arrow Flight as an alternative shuffle
> transport mechanism for Apache Spark. The current *Netty-based block transfer
> would be replaced with Arrow Flight's gRPC streaming protocol, enabling
> zero-copy* network transfer of columnar data, true bidirectional streaming,
> native columnar shuffle using Apache Arrow format, and projected 2-5x
> improvement in shuffle-heavy workloads.
>
> The existing Spark shuffle implementation has several performance
> {*}bottlenecks{*}:
> Problem 1 - *Row-based serialization:* Spark uses UnsafeRow format, requiring
> per-row serialization and deserialization overhead.
> Problem 2 - {*}Multiple memory copies{*}: Data is copied 6-8 times between
> mapper and reducer due to serialization, disk I/O, and network transfer
> boundaries.
> Problem 3 - \{*}Request-response model{*}: The current Netty-based transport
> uses a request-response pattern rather than true streaming, introducing
> latency for large data transfers.
> Problem 4 - *CPU overhead:* Java serialization consumes significant CPU
> cycles, especially for high-throughput workloads.
>
> *+The proposed architecture+* replaces the Netty-based block transfer with
> Arrow Flight gRPC streaming. Each executor runs a Flight server that serves
> shuffle data directly to requesting reducers.
>
> Design and approach discussion:
> [https://docs.google.com/document/d/14Gki2k-Lkr7P5n8tAHGi44XYDpr-YomRFFh-kFh3-54/edit?usp=sharing]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]