[
https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981169#comment-14981169
]
Joe Skora commented on NIFI-994:
--------------------------------
That seems reasonable in general, and I really am trying to help. :-D
I'm not trying to be argumentative, but I don't want you to put a big effort in
trying to reach 100% if it is impossible. I'd rather have a simpler processor
that makes a best effort, and make sure users know about the potential problems.
Of the many possible scenarios, I picked the following 4. Scenario #2 results
in lost content and cannot be fixed even with checksumming. Scenario #4 is not
distinguishable from #2 without checksumming the whole file and it could have
additional lost data if there was a log write between #4/T1 and #3/T2.
* Scenario #1 - file grows but no rotation occurs - no data loss
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2
*# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2
* Scenario #2 - rotation truncates file - data written after last processing
but before truncation is lost
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2 (**LOST WRITE,
UNFIXABLE**)
*# T3 - logger truncates file => len=0, timestamp=T3
*# T4 - logger writes 1K to file => len=1K, timestamp=T4
*# T5 - tail processor processes 0-1K, stores checksum(T5) and timestamp=T4
* Scenario #3 - file grows but no rotation occurs
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - logger writes 2K to file => len=4K, timestamp=T2
*# T3 - tail processor processes 2K-4K, stores checksum(T3) and timestamp=T2
* Scenario #4 - rotation occurs but file size exceeds size at last processing
*# T0 - logger writes 2K to file => len=2K, timestamp=T0
*# T1 - tail processor processes 0-2K, stores checksum(T1) and timestamp=T0
*# T2 - (**log write here would be lost**)
*# T3 - logger rotates file => len=0, timestamp=T3
*# T4 - logger writes 4K to file => len=4K, timestamp=T4 (**PARTIALLY LOST
WRITE**) (**LOOKS LIKE #3/T2**)
*# T5 - tail processor processes 2K-4K, stores checksum(T5) and timestamp=T4
As long as the file can change outside NiFi's control of NiFi (and could change
quickly in some cases), I think it is impossible to design a lossless approach
without copying the data, and even that could be impossible depending on volume
and load.
Thoughts.
> Processor to tail files
> -----------------------
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
> Issue Type: New Feature
> Affects Versions: 0.4.0
> Reporter: Joseph Percivall
> Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch,
> 0002-NIFI-994-Ensure-that-processor-is-not-valid-due-to-t.patch
>
>
> It's a very common data ingest situation to want to input text into the
> system by "tailing" a file, most commonly log files. Currently we don't have
> an easy way to do this.
> A simple processor to tail a file would benefit many users. There would need
> to be an option to not just tail a file but pick up where the processor left
> off if it is interrupted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)