Unfortunately I do not have bandwidth to do a detailed review, but a few
things come to mind after a quick read:

- While it might be tactically beneficial to align with existing
implementation, a clean design which does not tie into existing shuffle
implementation would be preferable (if it can be done without over
engineering). Shuffle implementation can change and there are custom
implementations and experiments which differ quite a bit from what comes
with Apache Spark.


- Please keep speculative execution in mind while designing the interfaces:
in spark, implicitly due to task scheduler logic, you won’t have conflicts
at an executor for (shuffleId, mapId) and (shuffleId, mapId, reducerId)
tuple.
When you externalize it, there can be conflict : passing a way to
distinguish different tasks for same partition would be necessary for
nontrivial implementations.


This would be a welcome and much needed enhancement to spark- looking
forward to its progress !


Regards,
Mridul



On Wed, May 8, 2019 at 11:24 AM Yifei Huang (PD) <yif...@palantir.com>
wrote:

> Hi everyone,
>
> For the past several months, we have been working on an API for pluggable
> storage of shuffle data. In this SPIP, we describe the proposed API, its
> implications, and how it fits into other work being done in the Spark
> shuffle space. If you're interested in Spark shuffle, and especially if you
> have done some work in this area already, please take a look at the SPIP
> and give us your thoughts and feedback.
>
> Jira Ticket: https://issues.apache.org/jira/browse/SPARK-25299
> SPIP:
> https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit
>
> Thank you!
>
> Yifei Huang and Matt Cheah
>
>
>

Reply via email to