Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/4420#issuecomment-74946838
@mccheah @mingyukim yeah, there isn't an OOM proof solution at all because
these are all heuristics. Even checking every element is not OOM proof since
memory estimation is itself a heuristic that involves sampling. My only concern
with exposing knobs here is that users will expect us to support these going
forward, even though we may want to refactor this in the future in a way where
those knobs don't make sense anymore. It's reasonable users would consider it a
regression if their tuning of those knobs stopped working.
So if possible, it would be good to adjust our heuristics to meet a wider
range of use cases and then if we keep hearing more issues we can expose knobs.
We can't have them meet every possible use case, since they are heuristics, but
in this case I was wondering if we could have a strict improvement to the
heuristics. @andrewor14 can you comment on whether this is indeed a strict
improvement?
One of the main benefits of the new data frames API is that we will be able
to have precise control over memory usage in a way that can avoid OOM's ever.
But for the current Spark API we are using this more ad-hoc memory estimation
along with some heuristics.
I'm not 100% against exposing knobs either, but I'd be interested if some
simple improvements fix your use case.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]