AngersZhuuuu commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353
########## core/src/main/scala/org/apache/spark/SparkContext.scala: ########## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: Executor log when `updateDependencies` ``` 23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp 23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt ``` In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename. For sparkcontext dirver, current code pass `useCache=false` only fetch file as file temp ``` 23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903 23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp ``` So the added file won't exist under root dir with it's filename. The code of `Utils.fetchFile()` as below <img width="1110" alt="截屏2023-11-22 上午10 21 58" src="https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd"> It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc. But current code, sc won't add this file with it's file name. So I think should be like 1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver 2. For non-local mode executor will also update the dependencies and work well 3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org