[
https://issues.apache.org/jira/browse/NIFI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt Burgess updated NIFI-7646:
-------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
> Improve performance of MergeContent / others that read content of many small
> FlowFiles
> --------------------------------------------------------------------------------------
>
> Key: NIFI-7646
> URL: https://issues.apache.org/jira/browse/NIFI-7646
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 1.14.0, 1.13.1
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> When MergeContent merges together 1,000 FlowFiles, it must read the content
> of each of those FlowFiles. This is done by calling
> `ProcessSession.read(flowFile);`
> Right now, the Process Session ends up calling
> `ContentRepository.read(ContentClaim)` using the Content Claim from the given
> FlowFile. As a result, the Content Repository creates a new FileInputStream
> (1+ disk accesses). It then seeks to the appropriate location on disk (1 disk
> access). The stream is then wrapped in a LimitingInputStream to prevent the
> reader from going beyond the boundaries of the associated Content Claim. So
> if the FlowFile is small, say 200 bytes, the result is that we perform 2+
> disk accesses to read those 200 bytes (even though 4K - 8K is a typical block
> size and could be read in the same amount of time as those 200 bytes).
> As a result, merging 1,000 FlowFiles can result in many disk accesses and a
> huge degradation in performance.
> At the same ProcessSession already has a notion of the currentReadClaimStream
> and a currentReadClaim. We could get huge performance improvements by making
> a couple of small changes in Content Repo & Process Session:
> - In ContentRepository, introduce a new method: `InputStream
> read(ResourceClaim resourceClaim) throws IOException`. This will allow the
> Process Session to read the entire contents of the underlying Resource Claim
> if necessary. This is safe since it doesn't provide raw access to any "user
> code". The Process Session will protect the bounds properly.
> - ProcessSession should use this new method to access the stream for an
> entire ResourceClaim. It should then skip to the appropriate location, as
> that will not have been done by the Content Repository. Then, the InputStream
> should be wrapped in a BufferedInputStream. This will help for cases when a
> LimitingInputStream restricts reads to only 200 bytes - in this case, the
> disk access will still pull back 4-8K and that will live in the
> BufferedInputStream.
> - ProcessSession should change the currentReadClaim from a Content Claim to a
> Resource Claim to allow for this to work. Additionally, the getInputStream()
> method should relax the constraint "writeRecursionSet.isEmpty()" for reusing
> the stream and instead use "!writeRecursionSet.contains(flowFile)", as this
> will be important for MergeContent, since it will be writing to one FlowFile
> while reading from another.
> These changes will transparently (to the processors) provide a very
> significant performance gain in cases where a Processor must read the content
> of many small FlowFiles, if the FlowFiles all have the same Resource Claim
> (which is the case more often than not).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)