Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/3670#issuecomment-68810961
  
    I think it's fine to add this - it would be good to have a more fully 
fleshed version including tests, etc. To answer your questions:
    
    ```
    Should we allow specifying local dirs as well? The best way to do this 
would probably be to archive them. The drawback is that it would require a fair 
bit of code that I don't know of any current use cases for.
    ```
    IMO - we can punt on adding local directories and add it later if desired. 
For now, we should throw an exception if a `file://` or URI with no scheme (and 
there is no default FS) is passed.
    
    ```
    The addFiles implementation has a caching component that I don't entirely 
understand. What events are we caching between? AFAICT it's users calling 
addFile on the same file in the same app at different times? Do we want/need to 
add something similar for addDirectory. The addFiles implementation will check 
to see if an added file already exists and has the same contents. I imagine we 
want the same behavior, so planning to add this unless people think otherwise.
    ```
    This only answers part of your question, but we need to decide what the 
semantics of calling `addDirectory` are if the contents changed. I think that 
for `addFile` we support this... i.e. you can add a file then change its 
contents then re-add it and we make sure to invalidate stale copies on the 
executors. We detect this by looking if the timestamp changed. This seems hard 
to do for directories. To do it properly we'd need to check if any of the files 
changed and also do some type of diff to efficiently ship only the new files. 
So maybe we should just fail-fast and throw an exception if addDirectory is 
called twice and the underlying directory, or any of its recursive contents, 
have mutated (@joshrosen - what do you think of that)?
    
    Popping up a bit, I wonder if the user API should still be called `addFile` 
and we should just add a flag `recursive` that will allow you to add 
directories. And we'd just say in the doc that the semantics are more 
restrictive for directories in that (a) it needs to be in a globally visible 
filesystem and (b) you cannot add the same directory twice - i.e. we make an 
immutable copy the first time it is added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to