srowen commented on a change in pull request #25545: [SPARK-28843][PYTHON] Set 
OMP_NUM_THREADS to executor cores for python
URL: https://github.com/apache/spark/pull/25545#discussion_r318690096
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
 ##########
 @@ -106,6 +106,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
     val startTime = System.currentTimeMillis
     val env = SparkEnv.get
     val localdir = env.blockManager.diskBlockManager.localDirs.map(f => 
f.getPath()).mkString(",")
+    // if OMP_NUM_THREADS is not explicitly set, override it with the number 
of cores
+    if (conf.getOption("spark.executorEnv.OMP_NUM_THREADS").isEmpty) {
+      // SPARK-28843: limit the OpenMP thread pool to the number of cores 
assigned to this executor
+      // this avoids high memory consumption with pandas/numpy because of a 
large OpenMP thread pool
+      // see https://github.com/numpy/numpy/issues/10455
+      
conf.getOption("spark.executor.cores").foreach(envVars.put("OMP_NUM_THREADS", 
_))
 
 Review comment:
   It remains up to users to control if desired. It's preventing it from using 
a pool that is clearly too big: nothing should cause a job to try to use more 
cores than the _executor_ is allowed to. This is the most conservative possible 
change that just makes a clearly wrong setting better in some cases. It is 
indeed likely in some cases that a lower setting is even better, but, we don't 
force that here as it 'depends' a bit.
   
   See my post above. For Pyspark for example, with 4 x 4-core executors on a 
16-core machines, in Pyspark, you get _256_ threads before, and 64 after. Even 
64 is too high, but better. It doesn't help a 1 x 16-core executor in Pyspark.
   
   Are you arguing for the 'more aggressive' change, to force it to 1 if unset 
in Pyspark? that's also coherent. But then I don't see why the less aggressive 
change is a problem.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to