Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74987043
  
    @mingyukim @mccheah
    Just to clarify, the reason we have the every N elements check is not to 
avoid spilling every element (unlikely), but to avoid entering the synchronized 
block every element. The latter is expensive if we have multiple threads trying 
to unroll blocks in memory.
    
    I think the original motivation for `trackMemoryThreshold` is to optimize 
the case with very small partitions. In such cases we currently don't even try 
to enter the synchronized block at all, the potential cost of which could be a 
non-trivial fraction of the relatively short aggregation time. However, this is 
all conjecture and I don't believe that this optimization is crucial for such 
workloads, so I don't see a particular reason to keep this threshold if 
removing it means that we can handle large items.
    
    However, simply removing the threshold won't solve the general case. There 
could be a burst of large items in the middle of the stream, in which case it 
will still buffer too many (32) of them in memory. One potential fix for this 
is to check every N bytes as opposed to every X elements. This is cheap because 
we already have an estimate of the map size before and after an element is 
inserted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to