[ 
https://issues.apache.org/jira/browse/NIFI-14095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17907920#comment-17907920
 ] 

Michael W Moser commented on NIFI-14095:
----------------------------------------

I also think a 1 minute default Run Schedule does not make sense for the 
GetFile processor.  This would cause GetFile to only ingress Batch Size (10) 
files per 1 minute, which could cause a misunderstanding that it does not 
perform very well.

GetFile will perform a fresh directory listing every "Polling Interval".  To 
try to avoid trouble with a default configuration, and to more closely match 
how ListFile works, I suggest we set the GetFile "Polling Interval" default to 
something that is not "0 sec".  Perhaps a default Polling Interval of "10 sec" 
or "1 min" would make sense?

So, with Run Schedule = 0 sec, GetFile can get a directory listing and wake up 
often to process files in that listing as fast as possible.  But a Polling 
Interval of "10 sec" will cause it to refresh that directory listing less often.

Does this make sense [~marfil] ?

> GetFile - "KeepSourceFile" set to true can fill up content repository
> ---------------------------------------------------------------------
>
>                 Key: NIFI-14095
>                 URL: https://issues.apache.org/jira/browse/NIFI-14095
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Configuration
>    Affects Versions: 2.0.0, 1.28.1
>            Reporter: Filip Maretić
>            Priority: Major
>              Labels: GetFile, ListFile
>
> Just setting the *KeepSourceFile* property to *true* can cause continuous 
> ingestion of files into NiFi. If the file is big (e.g. 20 GB) this can cause 
> the content repository (e.g. size of 400 GB) to be filled in an instant. This 
> renders the NiFi node unusable and a cleanup is needed. There is no reason 
> for this to happen, the flow should at least have enough time to process a 
> chunk of such a huge file before attempting to load the same file again.
> A quick solution would be just to add
> {code:java}
> @DefaultSchedule(strategy = SchedulingStrategy.TIMER_DRIVEN, period = "1 min")
> {code}
> This is anyway present on the ListFile processor, so why not to add it here 
> also? if the user really wants to set this to 0 seconds I guess he should be 
> aware of the consequences.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to