[
https://issues.apache.org/jira/browse/SPARK-56664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boyang Jerry Peng updated SPARK-56664:
--------------------------------------
Description:
The streaming shuffle is an alternative {{ShuffleManager}} implementation
designed for low-latency, continuously-running queries (for example, real-time
mode in Structured Streaming). Unlike the default sort-based shuffle, it does
not materialize map outputs to disk and does not require map tasks to finish
before reduce tasks can start. Instead, each map task hosts a network server
that pushes records to reduce tasks as they are produced; reduce tasks open
clients to those servers and consume records as a stream.
Design doc:
https://github.com/jerrypeng/spark/blob/1fe0abd72a317e8f73df1406966f8f49b24e8fd1/docs/streaming-shuffle.md
was:
The streaming shuffle is an alternative {{ShuffleManager}} implementation
designed for low-latency, continuously-running queries (for example, real-time
mode in Structured Streaming). Unlike the default sort-based shuffle, it does
not materialize map outputs to disk and does not require map tasks to finish
before reduce tasks can start. Instead, each map task hosts a network server
that pushes records to reduce tasks as they are produced; reduce tasks open
clients to those servers and consume records as a stream.
Design doc:
https://github.com/jerrypeng/spark/blob/ecb48211f821842caf717ebdf069da5505f43f68/docs/streaming-shuffle.md
> Add a new shuffle implementation to support Real-time Mode (RTM)
> ----------------------------------------------------------------
>
> Key: SPARK-56664
> URL: https://issues.apache.org/jira/browse/SPARK-56664
> Project: Spark
> Issue Type: Epic
> Components: Structured Streaming
> Affects Versions: 4.2.0
> Reporter: Boyang Jerry Peng
> Priority: Major
>
> The streaming shuffle is an alternative {{ShuffleManager}} implementation
> designed for low-latency, continuously-running queries (for example,
> real-time mode in Structured Streaming). Unlike the default sort-based
> shuffle, it does not materialize map outputs to disk and does not require map
> tasks to finish before reduce tasks can start. Instead, each map task hosts a
> network server that pushes records to reduce tasks as they are produced;
> reduce tasks open clients to those servers and consume records as a stream.
>
> Design doc:
>
> https://github.com/jerrypeng/spark/blob/1fe0abd72a317e8f73df1406966f8f49b24e8fd1/docs/streaming-shuffle.md
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]