[ 
https://issues.apache.org/jira/browse/NIFI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Payne updated NIFI-7646:
-----------------------------
    Fix Version/s: 1.14.0
           Status: Patch Available  (was: Open)

> Improve performance of MergeContent / others that read content of many small 
> FlowFiles
> --------------------------------------------------------------------------------------
>
>                 Key: NIFI-7646
>                 URL: https://issues.apache.org/jira/browse/NIFI-7646
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.14.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When MergeContent merges together 1,000 FlowFiles, it must read the content 
> of each of those FlowFiles. This is done by calling 
> `ProcessSession.read(flowFile);`
> Right now, the Process Session ends up calling 
> `ContentRepository.read(ContentClaim)` using the Content Claim from the given 
> FlowFile. As a result, the Content Repository creates a new FileInputStream 
> (1+ disk accesses). It then seeks to the appropriate location on disk (1 disk 
> access). The stream is then wrapped in a LimitingInputStream to prevent the 
> reader from going beyond the boundaries of the associated Content Claim. So 
> if the FlowFile is small, say 200 bytes, the result is that we perform 2+ 
> disk accesses to read those 200 bytes (even though 4K - 8K is a typical block 
> size and could be read in the same amount of time as those 200 bytes).
> As a result, merging 1,000 FlowFiles can result in many disk accesses and a 
> huge degradation in performance.
> At the same ProcessSession already has a notion of the currentReadClaimStream 
> and a currentReadClaim. We could get huge performance improvements by making 
> a couple of small changes in Content Repo & Process Session:
> - In ContentRepository, introduce a new method: `InputStream 
> read(ResourceClaim resourceClaim) throws IOException`. This will allow the 
> Process Session to read the entire contents of the underlying Resource Claim 
> if necessary. This is safe since it doesn't provide raw access to any "user 
> code". The Process Session will protect the bounds properly.
> - ProcessSession should use this new method to access the stream for an 
> entire ResourceClaim. It should then skip to the appropriate location, as 
> that will not have been done by the Content Repository. Then, the InputStream 
> should be wrapped in a BufferedInputStream. This will help for cases when a 
> LimitingInputStream restricts reads to only 200 bytes - in this case, the 
> disk access will still pull back 4-8K and that will live in the 
> BufferedInputStream.
> - ProcessSession should change the currentReadClaim from a Content Claim to a 
> Resource Claim to allow for this to work. Additionally, the getInputStream() 
> method should relax the constraint "writeRecursionSet.isEmpty()" for reusing 
> the stream and instead use "!writeRecursionSet.contains(flowFile)", as this 
> will be important for MergeContent, since it will be writing to one FlowFile 
> while reading from another.
> These changes will transparently (to the processors) provide a very 
> significant performance gain in cases where a Processor must read the content 
> of many small FlowFiles, if the FlowFiles all have the same Resource Claim 
> (which is the case more often than not).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to