Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74946838
  
    @mccheah @mingyukim yeah, there isn't an OOM proof solution at all because 
these are all heuristics. Even checking every element is not OOM proof since 
memory estimation is itself a heuristic that involves sampling. My only concern 
with exposing knobs here is that users will expect us to support these going 
forward, even though we may want to refactor this in the future in a way where 
those knobs don't make sense anymore. It's reasonable users would consider it a 
regression if their tuning of those knobs stopped working.
    
    So if possible, it would be good to adjust our heuristics to meet a wider 
range of use cases and then if we keep hearing more issues we can expose knobs. 
We can't have them meet every possible use case, since they are heuristics, but 
in this case I was wondering if we could have a strict improvement to the 
heuristics. @andrewor14 can you comment on whether this is indeed a strict 
improvement?
    
    One of the main benefits of the new data frames API is that we will be able 
to have precise control over memory usage in a way that can avoid OOM's ever. 
But for the current Spark API we are using this more ad-hoc memory estimation 
along with some heuristics.
    
    I'm not 100% against exposing knobs either, but I'd be interested if some 
simple improvements fix your use case. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to