Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3670#issuecomment-68975962
A lit of archaeology reveals that the original motivation for worry about
overwriting files was that in local mode files added through `addFile` would be
downloaded to the current working directory, so we wanted to prevent users from
accidentally deleting the original copies of files by downloading files with
the same names: https://github.com/mesos/spark/pull/345. This was before we
had the `SparkFiles` API, so at the time we assumed that user code would look
in the CWD for added files and therefore didn't have an alternative to placing
files in the local directory. Now that we have SparkFiles, though, I think we
can remove this overwrite protection logic since it was originally guarding
against Spark destroying users' source files, not against user behavior /
errors.
It looks like we added the `spark.files.overwrite` setting to explicitly
allow files to be overwritten with different contents (e.g. refreshing a file
across all executors): fd833e7ab1bf006c5ae1dff4767b02729e1bbfa7.
I guess the behavior for directories might be a bit different since you
might want to also account for deletions (e.g. if I delete a file in the
directory then re-add it, that file should probably be deleted on the executors
as well).
RE: the recursive flag, I guess the idea here is something like `cp` and
`cp -r` vs `cpFile` and `cpDir`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]