[
https://issues.apache.org/jira/browse/NIFI-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978741#comment-14978741
]
Mark Petronic edited comment on NIFI-631 at 10/28/15 4:55 PM:
--------------------------------------------------------------
This is my first Jira post - ever, so please excuse me if I mess up on the
'protocol'. Feel free to correct me. :) I am very interested in this feature
for my use case so I thought I would would provide some context to hopefully
help in the final implementation. In my case, I have to pull files from a set
of holding directories on NFS. At present, 26 different directories. I wish it
was event driven but it is not. So, I have to poll. These holding directories
hold files for 7 days then they are deleted. There are about 18,000 files
present across all 26 directories at any given time. New files are added
asynchronously and new files show up about every 5 minutes, sometimes more
frequently. The adding and deleting of the files from the holding directories
is outside of my control at the time. I CANNOT move or delete these files - I
have read only access.
So, from a 100% deterministic processing perspective, state would suit me best.
If I could, for example, retain state about each file processed for up to 8
days, then I could be sure I would never process a duplicate file. If the state
could have an age, then that would help clamp the upper end on the size of the
state stored.
One concern I have with using a last-processed time stamp approach is with race
conditions on the NFS. I believe it is possible that, while scanning the list
of files that are older than some last-fetched time, that a file could be moved
into a holding directory that the scan would miss. On the next pass, that file
would not be included because it would not be newer than the new last fetch
timestamp and would therefore be missed forever from my data ingest. Have you
considered such a race condition use case in the design?
I would be glad to test drive some early code as I am still in POC phase on my
first Hadoop deployment. Hope I can contribute in that way.
was (Author: mpetronic):
This is my first Jira post - ever, so please excuse me if I mess up on the
'protocol'. Feel free to correct me. :) I am very interested in this feature
for my use case so I thought I would would provide some context to hopefully
help in the final implementation. In my case, I have to pull files from a set
of holding directories on NFS. At present, 26 different directories. I wish it
was event driven but it is not. So, I have to poll. These holding directories
hold files for 7 days then they are deleted. There are about 18,000 files
present across all 26 directories at any given time. New files are added
asynchronously and new files show up about every 5 minutes, sometimes more
frequently. The adding and deleting of the files from the holding directories
is outside of my control at the time. I CANNOT move or delete these files - I
have read only access.
So, from a 100% deterministic processing perspective, state would suit me best.
If I could, for example, retain state about each file processed for up to 8
days, then I could be sure I would never process a duplicate file. If the state
could have an age, then that would help clamp the upper end on the size of the
state stored.
One concern I have with using a last-processed time stamp approach is with race
conditions on the NFS. I believe it is possible that, while scanning the list
of files that are older than some last-fetched time, that a file could be moved
into a holding directory that the scan would miss. On the next pass, that file
would not be included because it would not be newer than the new last fetch
timestamp and would therefore be missed forever from my data ingest. Have you
considered such a race condition use case in the design?
> Create ListFile and FetchFile processors
> ----------------------------------------
>
> Key: NIFI-631
> URL: https://issues.apache.org/jira/browse/NIFI-631
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Mark Payne
> Assignee: Joe Skora
> Attachments:
> 0001-NIFI-631-Initial-implementation-of-FetchFile-process.patch
>
>
> This pair of Processors will provide several benefits over the existing
> GetFile processor:
> 1. Currently, GetFile will continually pull the same files if the "Keep
> Source File" property is set to true. There is no way to pull the file and
> leave it in the directory without continually pulling the same file. We could
> implement state here, but it would either be a huge amount of state to
> remember everything pulled or it would have to always pull the oldest file
> first so that we can maintain just the Last Modified Date of the last file
> pulled plus all files with the same Last Modified Date that have already been
> pulled.
> 2. If pulling from a network attached storage such as NFS, this would allow a
> single processor to run ListFiles and then distribute those FlowFiles to the
> cluster so that the cluster can share the work of pulling the data.
> 3. There are use cases when we may want to pull a specific file (for example,
> in conjunction with ProcessHttpRequest/ProcessHttpResponse) rather than just
> pull all files in a directory. GetFile does not support this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)