[
https://issues.apache.org/jira/browse/NIFI-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647619#comment-17647619
]
ASF subversion and git services commented on NIFI-10888:
--------------------------------------------------------
Commit 78be613a0f85b664695ea2cbfaf26163f9b8e454 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=78be613a0f ]
NIFI-10888: When inferring a schema using a Record Reader, buffer up to 1 MB of
FlowFile content for the schema inference so that when we read the contents to
obtain records we can use the buffered data. This helps in cases of small
FlowFiles by not having to seek back to the beginning of the FlowFile every
time.
Signed-off-by: Matthew Burgess <[email protected]>
This closes #6725
> Improve performance of Record Readers when inferring schema of small FlowFiles
> ------------------------------------------------------------------------------
>
> Key: NIFI-10888
> URL: https://issues.apache.org/jira/browse/NIFI-10888
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Labels: performance
> Attachments: InferSchema-AfterChanges.png,
> InferSchema-BeforeChanges.png
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> When we infer the schema of a FlowFile, the Record Reader has to read all of
> the data in the FlowFile in order to infer the schema accurately. As a
> result, when we use a Record Reader, by default, we must parse the entire
> FlowFile, then seek back to the beginning of it, and parse the entire
> FlowFile again in order to return the records.
> It turns out that for smaller FlowFiles, the most expensive part of this
> cycle is actually seeking back to the beginning of the FlowFile (via
> {{{}InputStream.reset(){}}}). When {{InputStream.reset()}} is called, it
> closes the current InputStream and opens a new one, reading from the Content
> Repository again, causing a disk seek.
> Instead, if {{InputStream.mark()}} is called, we should use a
> BufferedInputStream under the hood, and if {{reset()}} is then called, we
> should call {{BufferedInputStream.reset()}} if the number of bytes consumed
> since mark is less than or equal to the read limit. We should then use
> {{{}InputStream.mark(1024 * 1024){}}}.
> Effectively, we should buffer up to 1 MB worth of content when inferring a
> schema. As a result, we can avoid that extra disk seek. For FlowFiles larger
> than 1 MB, this will not make a difference in performance. However, for
> larger FlowFiles it is less of a concern, simply because we are performing
> the seek less frequently (i.e., if we have 10 FlowFiles, each 50 MB vs. 1000
> FlowFiles each 5 KB, we end up seeking 100x less frequently in the case of
> larger FlowFiles).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)