[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

srowen Tue, 21 Apr 2015 05:17:45 -0700

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94771988
  
    Are you talking about a "shared" object? like, an array containing many 
references to the same object? 
    
    There does seem to be a difference in behavior in the sampled and 
non-sampled cases. In the non-sampled case, it accounts for shared data 
structures and counts them once. In the sampled case, it does not and counts 
them each time.
    
    It seems like they should behave the same way. The former better reflects 
the size in memory; the latter might better reflect the size as serialized to 
disk. But we can't really know the on-disk size this way, so I think that maybe 
it is supposed to use the same mechanism in both cases. That is, yes, it 
shouldn't double-count shared data structures in the sampled case. (CC @mateiz 
as the author of that bit in case he has comments.)
    
    However, that's not what you are changing at all, and I don't understand 
the intent of these changes. Why is 200 special, such that it needs to be 400? 
and what does the second loop accomplish? Why not use `enqueue` like the other 
branch does? the rest of this doesn't look like a correct change.
    
    It's not going to change the spilled size of data on disk. I think what it 
might do is *not* cause it to spill so early. This is highly dependent on the 
nature of the data being serialized -- whether there are shared data structures 
and how big the serialized form is. Since the question here is when to spill 
under memory pressure, I do think that making the logic similar for the sampled 
and non-sampled cases sounds correct.
    
    CC @sryza and @rxin as this might be a good catch regarding shuffle spills 
in certain cases, but I'm not entirely sure.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Reply via email to