[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...

sryza Wed, 22 Apr 2015 10:21:45 -0700

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4450#issuecomment-95272806
  
    Thanks for the review Patrick.
    
    Regarding Kryo:  I would be really really surprised if Kryo were to change 
its serialization format in such a drastic way without adding a backwards 
compatibility option.  Of course, not impossible. So yeah, the assumption here 
is that they won't do that.  If they did, and we upgraded our dependency, we'd 
need to remove or rethink this feature.
    
    The main reason for the ChainedBuffer is a performance optimization.  
However, it's not to minimize copies, it's to use memory efficiently and 
minimize spills.  If we use an array that grows by doubling, we can end up 
momentarily using 3x the memory that we actually need to store the bytes.  I.e. 
if we have 512 MB of data and write another MB, we then allocate a 1 GB array.  
If that byte was our last byte, we've ended up with a momentary allocation of 
1.5 GB to hold 513 MB of data.  Its other advantage is that we can support a 
buffer greater than 2GB in size.  BTW, the current default value of 8 KB is 
probably way too low.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...

Reply via email to