Mark Payne created NIFI-10888:
---------------------------------

             Summary: Improve performance of Record Readers when inferring 
schema of small FlowFiles
                 Key: NIFI-10888
                 URL: https://issues.apache.org/jira/browse/NIFI-10888
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Extensions
            Reporter: Mark Payne
            Assignee: Mark Payne


When we infer the schema of a FlowFile, the Record Reader has to read all of 
the data in the FlowFile in order to infer the schema accurately. As a result, 
when we use a Record Reader, by default, we must parse the entire FlowFile, 
then seek back to the beginning of it, and parse the entire FlowFile again in 
order to return the records.

It turns out that for smaller FlowFiles, the most expensive part of this cycle 
is actually seeking back to the beginning of the FlowFile (via 
{{{}InputStream.reset(){}}}). When {{InputStream.reset()}} is called, it closes 
the current InputStream and opens a new one, reading from the Content 
Repository again, causing a disk seek.

Instead, if {{InputStream.mark()}} is called, we should use a 
BufferedInputStream under the hood, and if {{reset()}} is then called, we 
should call {{BufferedInputStream.reset()}} if the number of bytes consumed 
since mark is less than or equal to the read limit. We should then use 
{{{}InputStream.mark(1024 * 1024){}}}.

Effectively, we should buffer up to 1 MB worth of content when inferring a 
schema. As a result, we can avoid that extra disk seek. For FlowFiles larger 
than 1 MB, this will not make a difference in performance. However, for larger 
FlowFiles it is less of a concern, simply because we are performing the seek 
less frequently (i.e., if we have 10 FlowFiles, each 50 MB vs. 1000 FlowFiles 
each 5 KB, we end up seeking 100x less frequently in the case of larger 
FlowFiles).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to