[
https://issues.apache.org/jira/browse/NIFI-11557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725535#comment-17725535
]
ASF subversion and git services commented on NIFI-11557:
--------------------------------------------------------
Commit a12c9ca9c72e8004afaf2f91088141ffd67ac437 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=a12c9ca9c7 ]
NIFI-11557: Avoid using the expensive and unnecessary Files.walkFileTree on
startup and initialization of Content Repository. Also performed some code
cleanup: IntelliJ flagged many warnings in the class, mostly around methods
that are no longer used and potential NullPointerExceptions, so those were
cleaned up. Additionally, removed the nifi property for max flowfiles per claim
- this property was never implemented. It was referenced, but the way in which
is was used curiously had nothing to do with what the property was intended to
be used for or for how it was documented. Instead, it was used to limit the max
number of claims that could remain writable. As a result, it was removed.
NIFI-11557: Added an additional system test and updated github actions to
include surefire-report in order to help diagnose problem that occurred in one
of the last system-test runs in Github. Could not replicate problem locally
Signed-off-by: Matthew Burgess <[email protected]>
This closes #7265
> Eliminate use of Files.walkFileTree for any performance-critical parts of
> application
> -------------------------------------------------------------------------------------
>
> Key: NIFI-11557
> URL: https://issues.apache.org/jira/browse/NIFI-11557
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework, Extensions
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Labels: content-repo, content-repository, performance, slowness,
> startup
> Fix For: 1.latest, 2.latest
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The FileSystemRepository (content repo implementation) as well as ListFile
> both make use of the {{Files.walkFileTree}} method. Recently, I worked with a
> user who had horribly long startup times. Thread dumps show that the time was
> almost entirely in the FileSystemRepository's {{initializeRepository}} method
> as it is walking the file tree in order to determine which archive files can
> be cleaned up next. This is done during startup and again periodically in
> background threads.
> I made a small modification locally to instead use the standard synchronous
> IO methods ( {{File.listFiles}} method. I used GenerateFlowFile to generate
> 1-byte FlowFiles and set {{nifi.content.claim.max.appendable.size=1 B}} in
> nifi.properties in order to generate a huge number of files - about 1.2
> million files in the content repository and restarted a few times.
> Additionally, added some log lines to show how long this part of the startup
> process took.
> With the existing code, startup took 210 seconds (3.5 mins). With the new
> implementation, it took 6.7 seconds. The appears to be due to the fact that
> when using NIO.2 for every file, it does an individual disk access to obtain
> File attributes, while when using the {{File.listFiles}} method the File
> objects that are returned already have the necessary attributes. As a result,
> the NIO.2 approach makes millions of disk accesses that are unnecessary. As
> the number of files in the repository grows, the discrepancy also grows.
> We need to eliminate any use of {{File.walkFileTree}} for any
> performance-critical parts of the codebase.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)