Koji and Roman,

Sorry to jump in here late, I meant to followup last week.

I created NIFI-3332 because Issue #2, when ListFile fires while between OS
writes of a batch of files, files with the same timestamp that the OS
writes after the processor fired are missed.  I suspect #1 is an is an
amplification of #2 where the second resolution will unfortunately increase
both the potential collision rate and potential state to be tracked each
1,000 fold.

I have a harder time with #3 as I understand the opinion that it's a new
file if I just wrote it, even if I kept the old timestamp.  But NiFi has to
use a discrete means to identify new files and I think it is reasonable to
use file timestamps, especially since this scenario can be mitigated by
updating the file timestamp.  It could be possible to use a combination of
modification and creation times (where both are available) to minimize
potential misses, but I don't think #3 is as likely as #1 and 2 once the
logic is understood, especially since a workaround is fairly easy.

I think a ListXXX processor that tracks events from Linux iNotify and/or
Windows FileSystemWatcher (or something similar) services would be a great
addition, but the simplicity of ListFile would still be useful if I could
trust it to not silently drop files.

I hope that helps.

Regards,
Joe

On Wed, Jun 14, 2017 at 5:00 AM, Koji Kawamura <[email protected]>
wrote:

> Hi Roman and all,
>
> As I investigated further on ListFile processor, I found those are two
> different issues.
> Also I found another JIRA related to ListFile. Currently there seem to
> be three issues:
>
> 1. ListFile can miss files with filesystems those do not provide
> timestamps in milliseconds precision (NIFI-4096)
> 2. ListFile can miss files having the same timestamp same as the
> previously processed latest timestamp (NIFI-3332)
> 3. ListFile can not pickup files whose timestamp is older than the
> previously processed latest timestamp (NIFI-2383)
>
> # NIFI-4096
> I created JIRA NIFI-4096 to address issue#1 above, by adding
> deterministic logic to detect target filesystem timestamp precision.
> With NIFI-4096, ListFile can list whole 10,000 files created by the
> command you shared before without missing anything:
>
> ```
> for i in {1..10000}; do touch ./test_$i; done
> ```
>
> The PR is ready for review. I appreciate if you can test the fix with
> your use case.
>
> Additionally, I refactored variable names in AbstractListProcessor to
> explain purpose and timestamp unit better. I hope it makes the code
> more readable and maintainable.
>
> # NIFI-3332
> I'm thinking about adding a processor property to specify whether
> track the listed filenames with the latest processed timestamp.
> Although it will be less efficient, it'd be good for some use cases.
>
> # NIFI-2383
> This is the most difficult case to handle right with only timestamp.
> We need different processor which can use watch API..
>
> Any comment would be appreciated.
>
> Thanks,
> Koji
>
> On Tue, Jun 6, 2017 at 9:18 PM, Koji Kawamura <[email protected]>
> wrote:
> > Hi Roman,
> >
> > I think NIFI-3332 is probably related as I can see timestamps in logs
> > don't have milliseconds.
> >
> > I've been considering how we can support all corner cases with minimal
> > state to persist, and make it works even if the filesystem only
> > provide last modified timestamp in seconds precision.
> > Changing code and testing locally, but not ready for send a PR yet,
> > and I am not fully confident on how to fix.
> >
> > Any suggestion or insight would be appreciated to make these ListXXXX
> > processor better.
> >
> > Thanks,
> > Koji
> >
> > On Tue, Jun 6, 2017 at 8:54 PM, Roman <[email protected]> wrote:
> >> Hi there,
> >>
> >> During digging into this issue, I found open issue in jira  NIFI-3332
> >> <https://issues.apache.org/jira/browse/NIFI-3332>  . Can it be related
> to my
> >> situation with missed milliseconds?
> >>
> >> Thanks
> >> Roman
> >>
> >>
> >> Koji Kawamura-2 wrote
> >>> Hello Roman,
> >>>
> >>> It seems the resolution of last modified timestamp depends on the file
> >>> system implementation.
> >>> https://stackoverflow.com/questions/3805201/how-to-get-
> ubuntu-file-timestamp-in-millisecond
> >>>
> >>> I reproduced the same behavior on OS X, which uses HFS that has the
> >>> same limitation of resolution in seconds.
> >>> https://stackoverflow.com/questions/18403588/how-to-
> return-millisecond-information-for-file-access-on-mac-os-x-in-java
> >>>
> >>> Which file system are you using on your Ubuntu? If it is ext3, then
> >>> changing it to ext4 may address the issue.
> >>>
> >>> Thanks,
> >>> Koji
> >>>
> >>> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;
> >>
> >>> ramon9869@
> >>
> >>> &gt; wrote:
> >>>> Hi there, i need help.
> >>>>
> >>>> We prepare high load project and tested this processors. All time see
> >>>> listing.timestamp and processed.timestamp keys without milliseconds
> >>>> (xxxxxxxxxx000). In this way, if generate several files in one second,
> >>>> not
> >>>> all files will be listened.
> >>>>
> >>>>
> >>>> Test:
> >>>> 1. start processor ListFile/ListSFTP
> >>>> 2. generate 10000 zero size files. my command:  for i in {1..10000};
> do
> >>>> touch ./test_$i; done
> >>>> 3. see processor stats: out 3952 (0 bytes)
> >>>>
> >>>>
> >>>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
> >>>>
> >>>> Environment
> >>>>
> >>>> Ubuntu 14.04.5 LTS, x64, ext4 file system
> >>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
> >>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
> >>>>
> >>>>
> >>>> Thanks
> >>>> Roman
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> View this message in context:
> >>>> http://apache-nifi-developer-list.39713.n7.nabble.com/
> processors-ListFile-ListSFTP-do-not-store-milliseconds-in-
> timestamp-tp16037.html
> >>>> Sent from the Apache NiFi Developer List mailing list archive at
> >>>> Nabble.com.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/processors-ListFile-ListSFTP-
> do-not-store-milliseconds-in-timestamp-tp16037p16118.html
> >> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>

Reply via email to