Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/5886#issuecomment-99329245
@vanzin yes the question is really what can happen inside `listStatus`.
It's not a question of being correct or trusted @liyezhang556520 but just what
can happen concurrently.
- `listStatus` begins at 99
- Finds old last modified time at 100
- File modified at 101
- Call returns at 102 with last modified time earlier than 101
This could be my ignorance but is this metadata update atomic in the name
node? meaning, as far as HDFS is concerned, the file could not have been
modified at 101 since it was busy with `listStatus` and recorded the
modification at some time >= 102?
... but this boxes the issue into a really small corner. What happens in
this case? I realize that some log file doesn't get processed until a bit later
then but does the subsequent processing then go wrong? If the application state
isn't corrupted or wrong afterwards, I think this isn't worth addressing.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]