Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5886#issuecomment-99329245
  
    @vanzin yes the question is really what can happen inside `listStatus`. 
It's not a question of being correct or trusted @liyezhang556520 but just what 
can happen concurrently.
    
    - `listStatus` begins at 99
    - Finds old last modified time at 100
    - File modified at 101
    - Call returns at 102 with last modified time earlier than 101
    
    This could be my ignorance but is this metadata update atomic in the name 
node? meaning, as far as HDFS is concerned, the file could not have been 
modified at 101 since it was busy with `listStatus` and recorded the 
modification at some time >= 102?
    
    ... but this boxes the issue into a really small corner. What happens in 
this case? I realize that some log file doesn't get processed until a bit later 
then but does the subsequent processing then go wrong? If the application state 
isn't corrupted or wrong afterwards, I think this isn't worth addressing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to