Github user liyezhang556520 commented on the pull request:
https://github.com/apache/spark/pull/5886#issuecomment-99295131
@vanzin , there is time interval from getting the first file's modification
time to the last file's. Assume there are 3 files: F1, F2, F3. And before
scanning, their modification times are TF1=100, TF2=101, TF3=102 respectively.
At time T1=103, we start scanning .
At time T2=104, we finished loading F1 mod time, starting to loading F2 mod
time.
At time T3=107, we finished loading F2 mod time. At this point,
`lastModifiedTime` is 101, which is equal to F2 mode time --- TF2. And during
loading F2 mod time, there are two operations:
First, at time T4=105, contents written to F1, which leads to F1 mod time
changing from TF1=100 to TF1'=105
Second, at time T5=106, contents written to F3, which leads to F3 mod time
changing from TF3=102 to TF3'=106.
Then we continue to load F3 mode time, and at time T6=108, we finished
loading F3 mode time. At this point, `lastModifiedTime` is 106.
So for the next round, we would not pick up F1 even it has been modified at
time T4=105.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]