[
https://issues.apache.org/jira/browse/NIFI-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Kawamura reassigned NIFI-4205:
-----------------------------------
Assignee: Koji Kawamura
> TailFile can produce duplicated data when it wrongly assumes a file is rotated
> ------------------------------------------------------------------------------
>
> Key: NIFI-4205
> URL: https://issues.apache.org/jira/browse/NIFI-4205
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 1.1.0
> Reporter: Koji Kawamura
> Assignee: Koji Kawamura
>
> TailFile checks whether a file being tailed is rotated by following lines of
> code, and if it determines so, it resets the reader and state for the file:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L693
> {code}
> // Check if file has rotated
> if (rolloverOccurred
> || (timestamp <= file.lastModified() && length >
> file.length())
> || (timestamp < file.lastModified() && length >=
> file.length())) {
> // Since file has rotated, we close the reader, create a new one,
> and then reset our state.
> try {
> reader.close();
> getLogger().debug("Closed FileChannel {}", new
> Object[]{reader, reader});
> } catch (final IOException ioe) {
> getLogger().warn("Failed to close reader for {} due to {}",
> new Object[]{file, ioe});
> }
> reader = createReader(file, 0L);
> position = 0L;
> checksum.reset();
> }
> {code}
> The third condition, newer timestamp but the same file size can work
> negatively in some situations. For example:
> # If an already fully tailed file is 'touched' and last modified timestamp is
> updated. This is the easiest way to produce duplicated content.
> # On Windows, if a file is being tailed and updated by an app that writes
> logs or some data to it consistently and frequently, then the last modified
> timestamp can be delayed to be updated compared to file size. I couldn't find
> canonical docs for this behavior, but testing on Windows consistently
> produces duplicated data. And the 3rd condition becomes true when such
> duplication occurs.
> TailFile updates the file timestamp and length when it reads some data from
> the file, specifically at these lines:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L765
> {code}
> timestamp = Math.max(state.getTimestamp(), file.lastModified());
> length = file.length();
> {code}
> As mentioned at the 2nd case above, file.lastModified can return a stale
> timestamp (or maybe just not being flushed yet) while length is replaced by
> the latest value. After this happens, at the next onTrigger cycle, the 3rd
> condition becomes true because it detects a newer timestamp.
> These conditions are added by NIFI-1170 and NIFI-1959.
> A simple flow, TailFile -> SplitText -> (FlowFiles are queued) ->
> UpdateAttribute(Stopped) can reproduce this, with a command-line to simulate
> frequently updated log file:
> {code}
> $ for i in `seq 1 10000`; do echo $i >> test.log; done
> {code}
> The expected result is having 10000 generated FlowFiles queued at the
> relationship between SplitText and UpdateAttribute. But on Windows, more
> FlowFiles are generated.
> By enabling debug level log for TailFile, following log messages can be
> confirmed:
> {code}
> Add this to conf/logback.xml
> <logger name="org.apache.nifi.processors.standard.TailFile" level="DEBUG"/>
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3]
> o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-0
> 15d-1000-d6c2-c57a61e58a80] Recovering Rolled Off Files; total number of
> files rolled off = 0
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3]
> o.a.nifi.processors.standard.TailFile
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Closed FileChannel
> sun.nio.ch.FileChannelImpl@6d2a1eaf
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3]
> o.a.nifi.processors.standard.TailFile
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Created FileChannel
> sun.nio.ch.FileChannelImpl@4aefddb3 for C:\logs\test.log
> 2017-07-19 10:22:07,150 DEBUG [Timer-Driven Process Thread-3]
> o.a.nifi.processors.standard.TailFile
> TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Reading lines starting at
> position 0
> {code}
> The 3rd condition should be removed to avoid having these duplicated data
> ingested. Or if there's any specific need, we should discuss about it and
> implement additional solution.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)