Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    @zsxwing, I meant to describe what happens on executors in the following 
scenario:
    
    - `addFile(foo)` is called for the first time at `timestamp = 1`
    - A task runs on an executor and downloads the copy of the file added at 
`timestamp = 1`.
      - By default, the [file fetch cache is 
enabled](https://github.com/apache/spark/blob/2eedc00b04ef8ca771ff64c4f834c25f835f5f44/core/src/main/scala/org/apache/spark/util/Utils.scala#L432)
 and filenames in that cache incorporate timestamps. Thus, this file will be 
downloaded to a file named `<hashcode of foo's URL>$timestamp_cache`.
    - `addFile(foo)` is called a second time at `timestamp = 2` and the same 
file is passed to it.
    - A task runs on an executor and discovers that the added file's timestamp 
(2) is newer than the timestamp of the file that it has already downloaded (1), 
so it tries to fetch files again:
      - Because the file with the newer timestamp is not present in the fetch 
file cache, a new copy of the file will be downloaded. **<--- this is the 
second download I was referring to**
    
    If the fetch file cache is disabled, on the other hand, then we directly 
call 
[`doFetchFile`](https://github.com/apache/spark/blob/2eedc00b04ef8ca771ff64c4f834c25f835f5f44/core/src/main/scala/org/apache/spark/util/Utils.scala#L617)
 which, in turn, will call `downloadFile()`, which [downloads the file to a 
temporary 
file](https://github.com/apache/spark/blob/2eedc00b04ef8ca771ff64c4f834c25f835f5f44/core/src/main/scala/org/apache/spark/util/Utils.scala#L499)
 before considering whether to overwrite an existing file.
    
    In either case, it looks like re-adding a file with a new timestamp will 
trigger downloads on the executors and those downloads will be unnecessary if 
the file's contents are unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to