Unfortunately I do not have bandwidth to do a detailed review, but a few things come to mind after a quick read:
- While it might be tactically beneficial to align with existing implementation, a clean design which does not tie into existing shuffle implementation would be preferable (if it can be done without over engineering). Shuffle implementation can change and there are custom implementations and experiments which differ quite a bit from what comes with Apache Spark. - Please keep speculative execution in mind while designing the interfaces: in spark, implicitly due to task scheduler logic, you won’t have conflicts at an executor for (shuffleId, mapId) and (shuffleId, mapId, reducerId) tuple. When you externalize it, there can be conflict : passing a way to distinguish different tasks for same partition would be necessary for nontrivial implementations. This would be a welcome and much needed enhancement to spark- looking forward to its progress ! Regards, Mridul On Wed, May 8, 2019 at 11:24 AM Yifei Huang (PD) <yif...@palantir.com> wrote: > Hi everyone, > > For the past several months, we have been working on an API for pluggable > storage of shuffle data. In this SPIP, we describe the proposed API, its > implications, and how it fits into other work being done in the Spark > shuffle space. If you're interested in Spark shuffle, and especially if you > have done some work in this area already, please take a look at the SPIP > and give us your thoughts and feedback. > > Jira Ticket: https://issues.apache.org/jira/browse/SPARK-25299 > SPIP: > https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit > > Thank you! > > Yifei Huang and Matt Cheah > > >