[
https://issues.apache.org/jira/browse/NIFI-14095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17907920#comment-17907920
]
Michael W Moser commented on NIFI-14095:
----------------------------------------
I also think a 1 minute default Run Schedule does not make sense for the
GetFile processor. This would cause GetFile to only ingress Batch Size (10)
files per 1 minute, which could cause a misunderstanding that it does not
perform very well.
GetFile will perform a fresh directory listing every "Polling Interval". To
try to avoid trouble with a default configuration, and to more closely match
how ListFile works, I suggest we set the GetFile "Polling Interval" default to
something that is not "0 sec". Perhaps a default Polling Interval of "10 sec"
or "1 min" would make sense?
So, with Run Schedule = 0 sec, GetFile can get a directory listing and wake up
often to process files in that listing as fast as possible. But a Polling
Interval of "10 sec" will cause it to refresh that directory listing less often.
Does this make sense [~marfil] ?
> GetFile - "KeepSourceFile" set to true can fill up content repository
> ---------------------------------------------------------------------
>
> Key: NIFI-14095
> URL: https://issues.apache.org/jira/browse/NIFI-14095
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Configuration
> Affects Versions: 2.0.0, 1.28.1
> Reporter: Filip Maretić
> Priority: Major
> Labels: GetFile, ListFile
>
> Just setting the *KeepSourceFile* property to *true* can cause continuous
> ingestion of files into NiFi. If the file is big (e.g. 20 GB) this can cause
> the content repository (e.g. size of 400 GB) to be filled in an instant. This
> renders the NiFi node unusable and a cleanup is needed. There is no reason
> for this to happen, the flow should at least have enough time to process a
> chunk of such a huge file before attempting to load the same file again.
> A quick solution would be just to add
> {code:java}
> @DefaultSchedule(strategy = SchedulingStrategy.TIMER_DRIVEN, period = "1 min")
> {code}
> This is anyway present on the ListFile processor, so why not to add it here
> also? if the user really wants to set this to 0 seconds I guess he should be
> aware of the consequences.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)