Github user squito commented on the issue:
https://github.com/apache/spark/pull/21977
> We've found that python requires a lot less memory than it actually uses
because it doesn't know when to GC
yes, totally agree, sorry I wasn't clear in my initial comment -- overall I
think this is a great idea!
> If we made python memory a subset of overhead, then we would see a lot
more people misconfiguring jobs that don't use python when they copy another
job's settings. This way we can avoid requesting this memory if the job isn't
PySpark. I also think it is more clear to allocate memory to the JVM, python,
and overhead separately. That way executor memory and python executor memory
are similar and you don't have to remember which one requires you to bump up
overhead as well.
while I agree with this to some extent, when users copy configs they
already get memory horribly wrong, they really just need to understand what
their job is doing. My concern is that meaning of the overhead parameter
becomes pretty confusing. Its (offheap JVM) + (any external process), unless
you have this new python conf set, in which case its offheap JVM + (any
external process other than python), though yarn still monitors based on
everything combined. Maybe thats unavoidable.
So if users don't set this conf, the behavior is the same as before, right?
And when they want to take advantage of it, they change their confs to just
move memory from the overhead to the new conf? I think I'm OK with it then, I
thought this was doing something else on the first read.
More general brainstorming -- I suppose there is no way to give python hint
to gc more often? This is sort of like moving from the UnifiedMemoryManager
back to the Static one, as now you put in a hard barrier. Seems worth it
anyway, just thinking about what this means.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]