[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

kayousterhout Tue, 28 Apr 2015 10:32:46 -0700

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-97144931
  
    It seems like there are two separate issues here:
    
    (1) Should Spark ever have an in-memory shuffle?  Personally I think we 
should, partially because it's useful for benchmarking and partially because 
there are some environments (as @mikeringenburg  pointed out) where it makes 
more sense to store shuffle data in-memory (for performance reasons or cluster 
provisioning reasons etc.).  However, @rxin, it sounds like you're pretty 
strongly against this for maintainability reasons; if you're going to block all 
attempts at doing this, we should just close SPARK-3376 as "Will not fix".
    
    (2) If yes to the above question, should we add *this particular* in-memory 
shuffle?  To list a few reasons why we might not want this implementation:
    -In it's current form, this implementation doesn't clean up in-memory 
shuffle files any more aggressively than normal shuffle files are cleaned up 
(so the shuffle data won't be deleted until the associated RDD goes out of 
scope).  In-memory shuffle data should really be cleaned up more aggressively, 
because unlike when we store shuffle data on-disk, there's a high cost of 
keeping the data around.
    -In addition to doing better cleanup of shuffle data, we likely would want 
to store shuffle data as a separate storage level (or with some kind of tag) so 
we can more cleanly fail when shuffle data becomes too large (i.e., explicitly 
fail with a "out of memory for shuffle" kind of exception, rather than a 
generic OOM).
    Parts of these issues are small and could just be fixed as part of this PR, 
while others are more substantial.  @sryza and @pwendell, it would help if you 
two could describe what you'd like to see in an ideal version of this feature, 
to understand whether they're things that can just be fixed as part of this PR.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Reply via email to