Mark Payne created NIFI-7992:
--------------------------------
Summary: Content Repository can fail to cleanup archive directory
fast enough
Key: NIFI-7992
URL: https://issues.apache.org/jira/browse/NIFI-7992
Project: Apache NiFi
Issue Type: Bug
Components: Core Framework
Reporter: Mark Payne
Assignee: Mark Payne
For the scenario where a use is generating many small FlowFiles and has the
"nifi.content.claim.max.appendable.size" property set to a small value, we can
encounter a situation where data is constantly archived but not cleaned up
quickly enough. As a result, the Content Repository can run out of space.
The FileSystemRepository has a backpressure mechanism built in to avoid
allowing this to happen, but under the above conditions, it can sometimes fail
to prevent this situation. The backpressure mechanism works by performing the
following steps:
# When a new Content Claim is created, the Content Repository determines which
'container' to use.
# Content Repository checks if the amount of storage space used for the
container is greater than the configured backpressure threshold.
# If so, the thread blocks until a background task completes cleanup of the
archive directories.
However, in Step #2 above, it determines if the amount of space currently being
used by looking at a cached member variable. That cached member variable is
only updated on the first iteration, and when the said background task
completes.
So, now consider a case where there are millions of files in the content
repository archive. The background task could take a massive amount of time
performing cleanup. Meanwhile, processors are able to write to the repository
without any backpressure being applied because the background task hasn't
updated the cached variable for the amount of space used. This continues until
the content repository fills.
There are three important very simple things that should be changed:
# The background task should be faster in this case. While we cannot improve
the amount of time it takes to destroy the files, we do create an ArrayList to
hold all of the file info and then use an iterator, calling remove(). Under the
hood, this creates a copy of the underlying array for each file that is
removed. On my laptop, performing this procedure on an ArrayList with 1 million
elements took approximately 1 minute. Changing to a LinkedList took 15
milliseconds but took much more heap. Keeping an ArrayList, then removing all
of elements at the end (via ArrayList.subList(0, n).clear()) resulted in
similar performance to LinkedList with the memory footprint of ArrayList.
# The check to see whether or not the content repository's usage has crossed
the threshold should not rely entirely on a cache that is populated by a
process that can take a long time. It should periodically calculate the disk
usage itself (perhaps once per minute).
# When backpressure does get applied, it can appear that the system has frozen
up, not performing any sort of work. The background task that is clearing space
should periodically log its progress at INFO level to allow users to understand
that this action is taking place.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)