[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

sryza Tue, 28 Apr 2015 10:50:37 -0700

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-97151475
  
    My opinion is that the main criteria for including this are:
    * Is there an intention for and clear path to a state where this could be 
used in production? I think it's likely that this means adding spill 
functionality so we never OOM.
    * Is there somebody interested in actively maintaining this? There's 
already a pretty severe dearth of knowledge about shuffle workings. If we want 
to scale its complexity, we need to scale the number of contributors that 
understand it.
    
    > On Apr 28, 2015, at 10:32 AM, Kay Ousterhout <[email protected]> 
wrote:
    > 
    > It seems like there are two separate issues here:
    > 
    > (1) Should Spark ever have an in-memory shuffle? Personally I think we 
should, partially because it's useful for benchmarking and partially because 
there are some environments (as @mikeringenburg pointed out) where it makes 
more sense to store shuffle data in-memory (for performance reasons or cluster 
provisioning reasons etc.).  However, @rxin, it sounds like you're pretty 
strongly against this for maintainability reasons; if you're going to block all 
attempts at doing this, we should just close SPARK-3376 as "Will not fix".
    > 
    > (2) If yes to the above question, should we add this particular in-memory 
shuffle? To list a few reasons why we might not want this implementation:
    > -In it's current form, this implementation doesn't clean up in-memory 
shuffle files any more aggressively than normal shuffle files are cleaned up 
(so the shuffle data won't be deleted until the associated RDD goes out of 
scope).  In-memory shuffle data should really be cleaned up more aggressively, 
because unlike when we store shuffle data on-disk, there's a high cost of 
keeping the data around.
    > -In addition to doing better cleanup of shuffle data, we likely would 
want to store shuffle data as a separate storage level (or with some kind of 
tag) so we can more cleanly fail when shuffle data becomes too large (i.e., 
explicitly fail with a "out of memory for shuffle" kind of exception, rather 
than a generic OOM).
    > Parts of these issues are small and could just be fixed as part of this 
PR, while others are more substantial. @sryza and @pwendell, it would help if 
you two could describe what you'd like to see in an ideal version of this 
feature, to understand whether they're things that can just be fixed as part of 
this PR.
    > 
    > â
    > Reply to this email directly or view it on GitHub.
    >




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Reply via email to