Koji and Roman, Sorry to jump in here late, I meant to followup last week.
I created NIFI-3332 because Issue #2, when ListFile fires while between OS writes of a batch of files, files with the same timestamp that the OS writes after the processor fired are missed. I suspect #1 is an is an amplification of #2 where the second resolution will unfortunately increase both the potential collision rate and potential state to be tracked each 1,000 fold. I have a harder time with #3 as I understand the opinion that it's a new file if I just wrote it, even if I kept the old timestamp. But NiFi has to use a discrete means to identify new files and I think it is reasonable to use file timestamps, especially since this scenario can be mitigated by updating the file timestamp. It could be possible to use a combination of modification and creation times (where both are available) to minimize potential misses, but I don't think #3 is as likely as #1 and 2 once the logic is understood, especially since a workaround is fairly easy. I think a ListXXX processor that tracks events from Linux iNotify and/or Windows FileSystemWatcher (or something similar) services would be a great addition, but the simplicity of ListFile would still be useful if I could trust it to not silently drop files. I hope that helps. Regards, Joe On Wed, Jun 14, 2017 at 5:00 AM, Koji Kawamura <[email protected]> wrote: > Hi Roman and all, > > As I investigated further on ListFile processor, I found those are two > different issues. > Also I found another JIRA related to ListFile. Currently there seem to > be three issues: > > 1. ListFile can miss files with filesystems those do not provide > timestamps in milliseconds precision (NIFI-4096) > 2. ListFile can miss files having the same timestamp same as the > previously processed latest timestamp (NIFI-3332) > 3. ListFile can not pickup files whose timestamp is older than the > previously processed latest timestamp (NIFI-2383) > > # NIFI-4096 > I created JIRA NIFI-4096 to address issue#1 above, by adding > deterministic logic to detect target filesystem timestamp precision. > With NIFI-4096, ListFile can list whole 10,000 files created by the > command you shared before without missing anything: > > ``` > for i in {1..10000}; do touch ./test_$i; done > ``` > > The PR is ready for review. I appreciate if you can test the fix with > your use case. > > Additionally, I refactored variable names in AbstractListProcessor to > explain purpose and timestamp unit better. I hope it makes the code > more readable and maintainable. > > # NIFI-3332 > I'm thinking about adding a processor property to specify whether > track the listed filenames with the latest processed timestamp. > Although it will be less efficient, it'd be good for some use cases. > > # NIFI-2383 > This is the most difficult case to handle right with only timestamp. > We need different processor which can use watch API.. > > Any comment would be appreciated. > > Thanks, > Koji > > On Tue, Jun 6, 2017 at 9:18 PM, Koji Kawamura <[email protected]> > wrote: > > Hi Roman, > > > > I think NIFI-3332 is probably related as I can see timestamps in logs > > don't have milliseconds. > > > > I've been considering how we can support all corner cases with minimal > > state to persist, and make it works even if the filesystem only > > provide last modified timestamp in seconds precision. > > Changing code and testing locally, but not ready for send a PR yet, > > and I am not fully confident on how to fix. > > > > Any suggestion or insight would be appreciated to make these ListXXXX > > processor better. > > > > Thanks, > > Koji > > > > On Tue, Jun 6, 2017 at 8:54 PM, Roman <[email protected]> wrote: > >> Hi there, > >> > >> During digging into this issue, I found open issue in jira NIFI-3332 > >> <https://issues.apache.org/jira/browse/NIFI-3332> . Can it be related > to my > >> situation with missed milliseconds? > >> > >> Thanks > >> Roman > >> > >> > >> Koji Kawamura-2 wrote > >>> Hello Roman, > >>> > >>> It seems the resolution of last modified timestamp depends on the file > >>> system implementation. > >>> https://stackoverflow.com/questions/3805201/how-to-get- > ubuntu-file-timestamp-in-millisecond > >>> > >>> I reproduced the same behavior on OS X, which uses HFS that has the > >>> same limitation of resolution in seconds. > >>> https://stackoverflow.com/questions/18403588/how-to- > return-millisecond-information-for-file-access-on-mac-os-x-in-java > >>> > >>> Which file system are you using on your Ubuntu? If it is ext3, then > >>> changing it to ext4 may address the issue. > >>> > >>> Thanks, > >>> Koji > >>> > >>> On Thu, Jun 1, 2017 at 1:25 AM, Roman < > >> > >>> ramon9869@ > >> > >>> > wrote: > >>>> Hi there, i need help. > >>>> > >>>> We prepare high load project and tested this processors. All time see > >>>> listing.timestamp and processed.timestamp keys without milliseconds > >>>> (xxxxxxxxxx000). In this way, if generate several files in one second, > >>>> not > >>>> all files will be listened. > >>>> > >>>> > >>>> Test: > >>>> 1. start processor ListFile/ListSFTP > >>>> 2. generate 10000 zero size files. my command: for i in {1..10000}; > do > >>>> touch ./test_$i; done > >>>> 3. see processor stats: out 3952 (0 bytes) > >>>> > >>>> > >>>> I'm somewhere wrong? Or is it a bug nifi/java/etc? > >>>> > >>>> Environment > >>>> > >>>> Ubuntu 14.04.5 LTS, x64, ext4 file system > >>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode) > >>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2 > >>>> > >>>> > >>>> Thanks > >>>> Roman > >>>> > >>>> > >>>> > >>>> -- > >>>> View this message in context: > >>>> http://apache-nifi-developer-list.39713.n7.nabble.com/ > processors-ListFile-ListSFTP-do-not-store-milliseconds-in- > timestamp-tp16037.html > >>>> Sent from the Apache NiFi Developer List mailing list archive at > >>>> Nabble.com. > >> > >> > >> > >> > >> > >> -- > >> View this message in context: http://apache-nifi-developer- > list.39713.n7.nabble.com/processors-ListFile-ListSFTP- > do-not-store-milliseconds-in-timestamp-tp16037p16118.html > >> Sent from the Apache NiFi Developer List mailing list archive at > Nabble.com. >
