Antony created SPARK-53364: ------------------------------ Summary: spark spark.local.dir overwrite for yarn cluster execution Key: SPARK-53364 URL: https://issues.apache.org/jira/browse/SPARK-53364 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0, 3.5.6 Reporter: Antony
h3. Context: I have a Hadoop cluster with the following configuration: One OS disk 200gb Two SSD/NVMe disks (1-2 TB each) Eight HDDs 16tb In Hadoop/YARN, I cannot set different yarn.nodemanager.local-dirs locations for the file cache and the application cache. h3. Problem: i need ssd when start python job for python env filecache. i need hdd when start extra big sql job. I have seen the spark.local.dir property, but it is for Spark and does not solve the YARN-level configuration issue. h3. Resolve: I want to configure SPARK to use specific directory. I want to not always use yarn directories from yarn nodemanager here {code:java} def getConfiguredLocalDirs(conf: SparkConf): Array[String] = { val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED) if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") } else if (conf.getenv("MESOS_SANDBOX") != null && !shuffleServiceEnabled) { // Mesos already creates a directory per Mesos task. Spark should use that directory // instead so all temporary files are automatically cleaned up when the Mesos task ends. // Note that we don't want this if the shuffle service is enabled because we want to // continue to serve shuffle files after the executors that wrote them have already exited. Array(conf.getenv("MESOS_SANDBOX")) } else { if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) { logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox because " + s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.") } // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user // configuration to point to a secure directory. So create a subdirectory with restricted // permissions under each listed directory. conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",") } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org