AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353
##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
logInfo(s"Added file $path at $key with timestamp $timestamp")
// Fetch the file locally so that closures which are run on the driver
can still use the
// SparkFiles API to access files.
- Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration,
timestamp, useCache = false)
+ Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration,
timestamp, useCache = true)
Review Comment:
Executor log when `updateDependencies`
```
23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
23/11/21 17:44:55 INFO Utils: Copying
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache
to
/mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
```
In executor side, pass `useCache = true` when is not local mode, then
executor will fetch the file to cache then copy cache file to root dir with
filename.
For sparkcontext dirver, current code pass `useCache=false` only fetch file
as file temp
```
23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file
hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp
1700559593903
23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching
hdfs://path/feature_map.txt to
/mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
```
So the added file won't exist under root dir with it's filename.
The code of `Utils.fetchFile()` as below
<img width="1110" alt="截屏2023-11-22 上午10 21 58"
src="https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd">
It's clear that executor is local should pass `useCache=false` since in
local mode, it should use file fetched by sc.
But current code, sc won't add this file with it's file name.
So I think should be like
1. SC add file should also copy file to root dir with the file name, then
driver side also can get the file with file name then can run local task in
driver
2. For non-local mode executor will also update the dependencies and work
well
3. For local mode executor, it was started in driver process. It can use the
file downloaded by `SC.addFile()`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]