Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/5403#issuecomment-97151475
My opinion is that the main criteria for including this are:
* Is there an intention for and clear path to a state where this could be
used in production? I think it's likely that this means adding spill
functionality so we never OOM.
* Is there somebody interested in actively maintaining this? There's
already a pretty severe dearth of knowledge about shuffle workings. If we want
to scale its complexity, we need to scale the number of contributors that
understand it.
> On Apr 28, 2015, at 10:32 AM, Kay Ousterhout <[email protected]>
wrote:
>
> It seems like there are two separate issues here:
>
> (1) Should Spark ever have an in-memory shuffle? Personally I think we
should, partially because it's useful for benchmarking and partially because
there are some environments (as @mikeringenburg pointed out) where it makes
more sense to store shuffle data in-memory (for performance reasons or cluster
provisioning reasons etc.). However, @rxin, it sounds like you're pretty
strongly against this for maintainability reasons; if you're going to block all
attempts at doing this, we should just close SPARK-3376 as "Will not fix".
>
> (2) If yes to the above question, should we add this particular in-memory
shuffle? To list a few reasons why we might not want this implementation:
> -In it's current form, this implementation doesn't clean up in-memory
shuffle files any more aggressively than normal shuffle files are cleaned up
(so the shuffle data won't be deleted until the associated RDD goes out of
scope). In-memory shuffle data should really be cleaned up more aggressively,
because unlike when we store shuffle data on-disk, there's a high cost of
keeping the data around.
> -In addition to doing better cleanup of shuffle data, we likely would
want to store shuffle data as a separate storage level (or with some kind of
tag) so we can more cleanly fail when shuffle data becomes too large (i.e.,
explicitly fail with a "out of memory for shuffle" kind of exception, rather
than a generic OOM).
> Parts of these issues are small and could just be fixed as part of this
PR, while others are more substantial. @sryza and @pwendell, it would help if
you two could describe what you'd like to see in an ideal version of this
feature, to understand whether they're things that can just be fixed as part of
this PR.
>
> â
> Reply to this email directly or view it on GitHub.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]