rdblue commented on issue #21977: [SPARK-25004][CORE] Add 
spark.executor.pyspark.memory limit.
URL: https://github.com/apache/spark/pull/21977#issuecomment-456139148
 
 
   @HyukjinKwon, I like the idea of using `spark.executor.pyspark.memory` to 
control or bound the other setting, but I don't think that it can be used to 
replace `spark.python.worker.memory`.
   
   The problem is that the first setting controls the total size of the address 
space and the second is a threshold that will cause data to be spilled. If the 
threshold for spilling is the total size limit, then Python would run out of 
memory before it started spilling data.
   
   I think it makes sense to have both settings. The JVM has executor memory 
and spark memory (controlled by `spark.memory.fraction`), so these settings 
create something similar: total python memory and the threshold above which 
PySpark will spill to disk.
   
   I think that means the spill setting should have a better name and should be 
limited by the total memory. Maybe ensure it's max is 
`spark.executor.pyspark.memory` - 300MB or something reasonable?
   
   I think we should avoid introducing a property like `spark.memory.fraction` 
for Python. That is confusing  for users and often ignored, leading to wasted 
memory. Setting explicit sizes is a better approach.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to