Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95104653
  
    I understand this a bit better. I see that `estimate` already accounts for 
shared objects since it calls `enqueue` so my comment above is wrong.
    
    However, wouldn't the shared data structures be counted when estimating the 
first non-null object? Isn't it sufficient to just discard the first sample? I 
understand that this only catches the case where all of the non-null objects 
have this reference to this state, but this approach can't always find such a 
thing no matter what if it's sampling. The two-loops approach can be written 
more simply, but even that complexity doesn't seem to buy much.
    
    If the purpose is to spill when there is memory pressure, then why ignore a 
shared data structure? it's still taking up memory. It doesn't get written to 
disk, likely, but that's not the question. Or is the purpose to spill when the 
spilled size is likely to be large enough, and memory usage is just a proxy for 
that?
    
    I'd like to see if merely discarding the first non-null sample fixes most 
of this more simply.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to