Ryan Blue created SPARK-28843:
---------------------------------

             Summary: Set OMP_NUM_THREADS to executor cores reduce Python 
memory consumption
                 Key: SPARK-28843
                 URL: https://issues.apache.org/jira/browse/SPARK-28843
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.4.3, 2.3.3, 3.0.0
            Reporter: Ryan Blue


While testing hardware with more cores, we found that the amount of memory 
required by PySpark applications increased and tracked the problem to importing 
numpy. The numpy issue isĀ [https://github.com/numpy/numpy/issues/10455]

NumPy uses OpenMP that starts a thread pool with the number of cores on the 
machine (and does not respect cgroups). When we set this lower we see a 
reduction in memory consumption.

This parallelism setting should be set to the number of cores allocated to the 
executor, not the number of cores available.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to