[
https://issues.apache.org/jira/browse/NIFI-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229447#comment-17229447
]
Joe Witt commented on NIFI-7992:
--------------------------------
Didn't review the code in detail but did review this writeup and thinking back
about 8 years ago when I think we last talked about this...the writeup/change
makes a lot of sense!
> Content Repository can fail to cleanup archive directory fast enough
> --------------------------------------------------------------------
>
> Key: NIFI-7992
> URL: https://issues.apache.org/jira/browse/NIFI-7992
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Critical
> Fix For: 1.13.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> For the scenario where a use is generating many small FlowFiles and has the
> "nifi.content.claim.max.appendable.size" property set to a small value, we
> can encounter a situation where data is constantly archived but not cleaned
> up quickly enough. As a result, the Content Repository can run out of space.
> The FileSystemRepository has a backpressure mechanism built in to avoid
> allowing this to happen, but under the above conditions, it can sometimes
> fail to prevent this situation. The backpressure mechanism works by
> performing the following steps:
> # When a new Content Claim is created, the Content Repository determines
> which 'container' to use.
> # Content Repository checks if the amount of storage space used for the
> container is greater than the configured backpressure threshold.
> # If so, the thread blocks until a background task completes cleanup of the
> archive directories.
> However, in Step #2 above, it determines if the amount of space currently
> being used by looking at a cached member variable. That cached member
> variable is only updated on the first iteration, and when the said background
> task completes.
> So, now consider a case where there are millions of files in the content
> repository archive. The background task could take a massive amount of time
> performing cleanup. Meanwhile, processors are able to write to the repository
> without any backpressure being applied because the background task hasn't
> updated the cached variable for the amount of space used. This continues
> until the content repository fills.
> There are three important very simple things that should be changed:
> # The background task should be faster in this case. While we cannot improve
> the amount of time it takes to destroy the files, we do create an ArrayList
> to hold all of the file info and then use an iterator, calling remove().
> Under the hood, this creates a copy of the underlying array for each file
> that is removed. On my laptop, performing this procedure on an ArrayList with
> 1 million elements took approximately 1 minute. Changing to a LinkedList took
> 15 milliseconds but took much more heap. Keeping an ArrayList, then removing
> all of elements at the end (via ArrayList.subList(0, n).clear()) resulted in
> similar performance to LinkedList with the memory footprint of ArrayList.
> # The check to see whether or not the content repository's usage has crossed
> the threshold should not rely entirely on a cache that is populated by a
> process that can take a long time. It should periodically calculate the disk
> usage itself (perhaps once per minute).
> # When backpressure does get applied, it can appear that the system has
> frozen up, not performing any sort of work. The background task that is
> clearing space should periodically log its progress at INFO level to allow
> users to understand that this action is taking place.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)