Mark Payne created NIFI-10888:
---------------------------------
Summary: Improve performance of Record Readers when inferring
schema of small FlowFiles
Key: NIFI-10888
URL: https://issues.apache.org/jira/browse/NIFI-10888
Project: Apache NiFi
Issue Type: Improvement
Components: Extensions
Reporter: Mark Payne
Assignee: Mark Payne
When we infer the schema of a FlowFile, the Record Reader has to read all of
the data in the FlowFile in order to infer the schema accurately. As a result,
when we use a Record Reader, by default, we must parse the entire FlowFile,
then seek back to the beginning of it, and parse the entire FlowFile again in
order to return the records.
It turns out that for smaller FlowFiles, the most expensive part of this cycle
is actually seeking back to the beginning of the FlowFile (via
{{{}InputStream.reset(){}}}). When {{InputStream.reset()}} is called, it closes
the current InputStream and opens a new one, reading from the Content
Repository again, causing a disk seek.
Instead, if {{InputStream.mark()}} is called, we should use a
BufferedInputStream under the hood, and if {{reset()}} is then called, we
should call {{BufferedInputStream.reset()}} if the number of bytes consumed
since mark is less than or equal to the read limit. We should then use
{{{}InputStream.mark(1024 * 1024){}}}.
Effectively, we should buffer up to 1 MB worth of content when inferring a
schema. As a result, we can avoid that extra disk seek. For FlowFiles larger
than 1 MB, this will not make a difference in performance. However, for larger
FlowFiles it is less of a concern, simply because we are performing the seek
less frequently (i.e., if we have 10 FlowFiles, each 50 MB vs. 1000 FlowFiles
each 5 KB, we end up seeking 100x less frequently in the case of larger
FlowFiles).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)