[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API

JoshRosen Tue, 06 Jan 2015 19:57:12 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3670#issuecomment-68975962
  
    A lit of archaeology reveals that the original motivation for worry about 
overwriting files was that in local mode files added through `addFile` would be 
downloaded to the current working directory, so we wanted to prevent users from 
accidentally deleting the original copies of files by downloading files with 
the same names: https://github.com/mesos/spark/pull/345.  This was before we 
had the `SparkFiles` API, so at the time we assumed that user code would look 
in the CWD for added files and therefore didn't have an alternative to placing 
files in the local directory.  Now that we have SparkFiles, though, I think we 
can remove this overwrite protection logic since it was originally guarding 
against Spark destroying users' source files, not against user behavior / 
errors.
    
    It looks like we added the `spark.files.overwrite` setting to explicitly 
allow files to be overwritten with different contents (e.g. refreshing a file 
across all executors): fd833e7ab1bf006c5ae1dff4767b02729e1bbfa7.
    
    I guess the behavior for directories might be a bit different since you 
might want to also account for deletions (e.g. if I delete a file in the 
directory then re-add it, that file should probably be deleted on the executors 
as well).
    
    RE: the recursive flag, I guess the idea here is something like `cp` and  
`cp -r` vs `cpFile` and `cpDir`?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API

Reply via email to