[
https://issues.apache.org/jira/browse/NIFI-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640194#comment-17640194
]
Mark Payne commented on NIFI-10888:
-----------------------------------
To demonstrate performance, created a huge number of tiny CSV files (60 bytes
each) and use UpdateRecord to parse, then write as JSON. Throughput is
increased by about 2.3x
> Improve performance of Record Readers when inferring schema of small FlowFiles
> ------------------------------------------------------------------------------
>
> Key: NIFI-10888
> URL: https://issues.apache.org/jira/browse/NIFI-10888
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Labels: performance
> Fix For: 1.20.0
>
> Attachments: InferSchema-AfterChanges.png,
> InferSchema-BeforeChanges.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When we infer the schema of a FlowFile, the Record Reader has to read all of
> the data in the FlowFile in order to infer the schema accurately. As a
> result, when we use a Record Reader, by default, we must parse the entire
> FlowFile, then seek back to the beginning of it, and parse the entire
> FlowFile again in order to return the records.
> It turns out that for smaller FlowFiles, the most expensive part of this
> cycle is actually seeking back to the beginning of the FlowFile (via
> {{{}InputStream.reset(){}}}). When {{InputStream.reset()}} is called, it
> closes the current InputStream and opens a new one, reading from the Content
> Repository again, causing a disk seek.
> Instead, if {{InputStream.mark()}} is called, we should use a
> BufferedInputStream under the hood, and if {{reset()}} is then called, we
> should call {{BufferedInputStream.reset()}} if the number of bytes consumed
> since mark is less than or equal to the read limit. We should then use
> {{{}InputStream.mark(1024 * 1024){}}}.
> Effectively, we should buffer up to 1 MB worth of content when inferring a
> schema. As a result, we can avoid that extra disk seek. For FlowFiles larger
> than 1 MB, this will not make a difference in performance. However, for
> larger FlowFiles it is less of a concern, simply because we are performing
> the seek less frequently (i.e., if we have 10 FlowFiles, each 50 MB vs. 1000
> FlowFiles each 5 KB, we end up seeking 100x less frequently in the case of
> larger FlowFiles).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)