rdblue commented on issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory limit. URL: https://github.com/apache/spark/pull/21977#issuecomment-456139148 @HyukjinKwon, I like the idea of using `spark.executor.pyspark.memory` to control or bound the other setting, but I don't think that it can be used to replace `spark.python.worker.memory`. The problem is that the first setting controls the total size of the address space and the second is a threshold that will cause data to be spilled. If the threshold for spilling is the total size limit, then Python would run out of memory before it started spilling data. I think it makes sense to have both settings. The JVM has executor memory and spark memory (controlled by `spark.memory.fraction`), so these settings create something similar: total python memory and the threshold above which PySpark will spill to disk. I think that means the spill setting should have a better name and should be limited by the total memory. Maybe ensure it's max is `spark.executor.pyspark.memory` - 300MB or something reasonable? I think we should avoid introducing a property like `spark.memory.fraction` for Python. That is confusing for users and often ignored, leading to wasted memory. Setting explicit sizes is a better approach.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
