[
https://issues.apache.org/jira/browse/NIFI-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269099#comment-17269099
]
Jens M Kofoed commented on NIFI-7263:
-------------------------------------
I have 2 cases.
Case 1:
We have a 2 networks which is isolated from each other with a datadiode. So
one of the network has no contact to the internet. From time to time we need to
transfer a new driver and software from the internet at transfer it to the
inside network. All transfering flows is handled by NIFI. in this case we have
a drop-folder where NIFI moves all files to a similar folder on the inside.
Here we use a List and Fetch processors instead of a getfile because we like to
use the benefit of the cluster. Since we manually copy files to the drop
folder, these files keeps there timestamp. So we can not use the "Tracking
Timestamps" strategy. If using the "Tracking Entities" strategy the "Entity
Tracking Time Window" needs to be set to years.
Case 2:
We have a file server where different systems write files in different
subfolders. We use NIFI to Syncronise all files looking in the root folder and
set Recurse Subdirectories to true. We are not allowed to delete files. So all
files will be there all the time. Therefore we can't use a GetFile process.
If we use the "Tracking Timestamps" strategy we have had a situation where a
file was not picked up by NIFI. If there are many files when NIFI start
scanning all files/folders, and a new files is written to the first folder just
after NIFI has looked in that folder, this file will not be in the list. If
another file is written to the last folder NIFI is scanning it will be in the
list and that file will have a newer/younger timestamp. So next time NIFI is
scanning the file will not be picked up, because it will be older than the last
timestamp.
Therefore we are using the "Tracking Entities" strategy which has another
issue. If you are using a filter regex and change it. The "Tracking Entities"
starts all over again, listing all files.
Therefore we have made our own flow where we create a hash value from path,
filename, filesize and timestamp check if that hash have been seen before.
We have had situation where some kind of files needed to be transferred again.
So with our own detection flow we can route all duplicated Hashes to a subflow
and create a new route for special situations.
In both these cases we really don't need any strategy built-in in the listfile
processor. We just need it to list all files, no matter timestamp.
> Add a No tracking Strategy to ListFile/ListFTP
> ----------------------------------------------
>
> Key: NIFI-7263
> URL: https://issues.apache.org/jira/browse/NIFI-7263
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Jens M Kofoed
> Assignee: Waleed Al Aibani
> Priority: Major
> Labels: ListFile, listftp
> Fix For: 1.13.0
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> The Listfile/ListFTP has 2 Listing Strategies: Tracking Timestamps and
> Tracking Entities.
> It would be very very nice if the List process also could have a No Tracking
> (fix it your self) strategy
> If running NIFI in a cluster the List/Fetch is the perfect solution instead
> of using a GetFile. But we have had many caces where files in the pickup
> folder has old timestamps, so here we have to use Tracking Entities.
> The issue is in cases where you are not allowed to delete files but you have
> to make a change to the file filter. The tracking entities start all over,
> and list all files again.
> In other situations we need to resent all data, and would like to clear the
> state of the Tracking Entities. But you can't.
> So I have to make a small flow for detecting duplicates. And in some cases
> just ignore duplicates and in other caces open up for sending duplicates. But
> it is a pain in the ... to use the Tracking Entities.
> So a NO STRATEGY would be very very nice
--
This message was sent by Atlassian Jira
(v8.3.4#803005)