Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/5608#issuecomment-94771988
Are you talking about a "shared" object? like, an array containing many
references to the same object?
There does seem to be a difference in behavior in the sampled and
non-sampled cases. In the non-sampled case, it accounts for shared data
structures and counts them once. In the sampled case, it does not and counts
them each time.
It seems like they should behave the same way. The former better reflects
the size in memory; the latter might better reflect the size as serialized to
disk. But we can't really know the on-disk size this way, so I think that maybe
it is supposed to use the same mechanism in both cases. That is, yes, it
shouldn't double-count shared data structures in the sampled case. (CC @mateiz
as the author of that bit in case he has comments.)
However, that's not what you are changing at all, and I don't understand
the intent of these changes. Why is 200 special, such that it needs to be 400?
and what does the second loop accomplish? Why not use `enqueue` like the other
branch does? the rest of this doesn't look like a correct change.
It's not going to change the spilled size of data on disk. I think what it
might do is *not* cause it to spill so early. This is highly dependent on the
nature of the data being serialized -- whether there are shared data structures
and how big the serialized form is. Since the question here is when to spill
under memory pressure, I do think that making the logic similar for the sampled
and non-sampled cases sounds correct.
CC @sryza and @rxin as this might be a good catch regarding shuffle spills
in certain cases, but I'm not entirely sure.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]