Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3670#issuecomment-72579535
  
    To briefly address the questions @sryza raised in the PR description:
    
    > Should we allow specifying local dirs as well? The best way to do this 
would probably be to archive them. The drawback is that it would require a fair 
bit of code that I don't know of any current use cases for.
    
    The user-facing API doesn't need to change in order to support this, so we 
could always choose to add this functionality later if we decide that there's a 
use-case for it.
    
    > The addFiles implementation has a caching component that I don't entirely 
understand. What events are we caching between? AFAICT it's users calling 
addFile on the same file in the same app at different times? Do we want/need to 
add something similar for addDirectory.
    
    There's a few different layers of caching here.  In all Spark versions, 
individual executors cache added files, so files will only be downloaded by the 
first task that require them.  In Spark 1.2.0+, we added a 
per-host-per-application cache that allows executors for the same application 
running on the same host to share downloaded files; this is useful in scenarios 
where you have a huge number of executors per host (see #1616 for more 
details).  `Utils.fetchFile` handles the lock file management to ensure that we 
only download files once per host, so I think we should already get that for 
free when downloading directories.
    
    > The addFiles implementation will check to see if an added file already 
exists and has the same contents. I imagine we want the same behavior, so 
planning to add this unless people think otherwise.
    
    This seems fine, although I guess it could be a lot more expensive to 
perform this check for a directory than a single file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to