mccheah opened a new pull request #25007: [SPARK-28209][CORE][SHUFFLE] Proposed 
new shuffle writer API 
URL: https://github.com/apache/spark/pull/25007
 
 
   ## What changes were proposed in this pull request?
   
   As part of the shuffle storage API proposed in SPARK-25299, this introduces 
an API for persisting shuffle data in arbitrary storage systems.
   
   This patch introduces several concepts:
   * `ShuffleDataIO`, which is the root of the entire plugin tree that will be 
proposed over the course of the shuffle API project.
   * `ShuffleExecutorComponents` - the subset of plugins for managing 
shuffle-related components for each executor. This will in turn instantiate 
shuffle readers and writers.
   * `ShuffleMapOutputWriter` interface - instantiated once per map task. This 
provides child `ShufflePartitionWriter` instances for persisting the bytes for 
each partition in the map task.
   
   The default implementation of these plugins exactly mirror what was done by 
the existing shuffle writing code - namely, writing the data to local disk and 
writing an index file. We leverage the APIs in the 
`BypassMergeSortShuffleWriter` only. Follow-up PRs will use the APIs in 
`SortShuffleWriter` and `UnsafeShuffleWriter`, but are left as future work to 
minimize the review surface area.
   
   ## How was this patch tested?
   
   New unit tests were added. Micro-benchmarks indicate there's no slowdown in 
the affected code paths.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to