[jira] [Created] (SPARK-53364) spark spark.local.dir overwrite for yarn cluster execution

Antony (Jira) Mon, 25 Aug 2025 02:47:05 -0700

Antony created SPARK-53364:
------------------------------

             Summary: spark spark.local.dir overwrite for yarn cluster execution
                 Key: SPARK-53364
                 URL: https://issues.apache.org/jira/browse/SPARK-53364
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 4.0.0, 3.5.6
            Reporter: Antony



h3. Context:

I have a Hadoop cluster with the following configuration:

One OS disk 200gb
Two SSD/NVMe disks (1-2 TB each)
Eight HDDs 16tb

In Hadoop/YARN, I cannot set different yarn.nodemanager.local-dirs locations 
for the file cache and the application cache.
h3. Problem:

i need ssd when start python job for python env filecache.
i need hdd when start extra big sql job.

I have seen the spark.local.dir property, but it is for Spark and does not 
solve the YARN-level configuration issue.
h3. Resolve:

I want to configure SPARK to use specific directory.

I want to not always use yarn directories from yarn nodemanager

here

 
{code:java}
  def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
    val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED)
    if (isRunningInYarnContainer(conf)) {
      // If we are in yarn mode, systems can have different disk layouts so we 
must set it
      // to what Yarn on this system said was available. Note this assumes that 
Yarn has
      // created the directories already, and that they are secured so that 
only the
      // user has access to them.
      randomizeInPlace(getYarnLocalDirs(conf).split(","))
    } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
      conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
    } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
      conf.getenv("SPARK_LOCAL_DIRS").split(",")
    } else if (conf.getenv("MESOS_SANDBOX") != null && !shuffleServiceEnabled) {
      // Mesos already creates a directory per Mesos task. Spark should use 
that directory
      // instead so all temporary files are automatically cleaned up when the 
Mesos task ends.
      // Note that we don't want this if the shuffle service is enabled because 
we want to
      // continue to serve shuffle files after the executors that wrote them 
have already exited.
      Array(conf.getenv("MESOS_SANDBOX"))
    } else {
      if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) {
        logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox 
because " +
          s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.")
      }
      // In non-Yarn mode (or for the driver in yarn-client mode), we cannot 
trust the user
      // configuration to point to a secure directory. So create a subdirectory 
with restricted
      // permissions under each listed directory.
      conf.get("spark.local.dir", 
System.getProperty("java.io.tmpdir")).split(",")
    }
  } {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-53364) spark spark.local.dir overwrite for yarn cluster execution

Reply via email to