[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

Imran Rashid (JIRA) Wed, 23 Jan 2019 14:59:19 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750504#comment-16750504
 ]


Imran Rashid commented on SPARK-26679:
--------------------------------------

I agree the old name "spark.python.worker.memory" is very confusing.  But I 
also don't see how you'd combine them.  There are two extreme cases: (1) an app 
which does a ton of stuff in python and uses a lot of python memory from user, 
but no usage of the sort machinery and (2) an app which uses the sort machinery 
within python, but makes very little use of allocating memory from user code.  
I don't think you have enough hooks into python's memory allocation to 
automatically spill if suddenly the user is trying to allocate more memory from 
python.

I agree it might make more sense to do something like spark.memory.fraction.  
I'm not sure if we should reuse that config for deciding what fraction of the 
pyspark memory goes to the pyspark shuffle machinery, or if there should be a 
new config spark.memory.pyspark.fraction.  (I can't think of a use case for 
keeping those separate)

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> -----------------------------------------------------------------------
>
>                 Key: SPARK-26679
>                 URL: https://issues.apache.org/jira/browse/SPARK-26679
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Ryan Blue
>            Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

Reply via email to