Github user JoshRosen commented on the issue:
https://github.com/apache/spark/pull/14396
A few other notes:
- It's important that we keep the timestamps because they're necessary for
cross-worker / cross-executor JAR + file download caching to work correctly.
This is a relatively obscure feature which was added a few releases ago in
order to improve scalability when running large clusters with many executors
per worker. Theoretically different applications will generally have different
timestamps for the same JAR so we could just as well use app ids for this but
I'd rather not make such an invasive change now.
- If we assume that `addFile()` will never be called with files with the
same name and different contents, so as not brick executors, then allowing
files to be downloaded so that the content-comparison can be performed on
executors will lead to performance problems if `addFile` is repeatedly called
with the same argument, since executors will repeatedly download the file
despite it having identical contents.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]