Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21977
@squito, this is much more clear for our user base. Right now, they can
control the YARN container allocation to make room for python by increasing the
overhead, but that does nothing to actually limit python to some defined space.
We've found that python requires a lot less memory than it actually uses
because it doesn't know when to GC. If we only had overhead, then we wouldn't
know what to limit python to.
If we made python memory a subset of overhead, then we would see a lot more
people misconfiguring jobs that don't use python when they copy another job's
settings. This way we can avoid requesting this memory if the job isn't
PySpark. I also think it is more clear to allocate memory to the JVM, python,
and overhead separately. That way executor memory and python executor memory
are similar and you don't have to remember which one requires you to bump up
overhead as well.
For supported platforms, I think that it's only windows that doesn't
support the limits. Even on systems that don't support the limit, explicitly
allocating memory to python is better because users see something to increase
when memory runs out, instead of needing to know that they should increase some
generic overhead setting.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]