[ 
https://issues.apache.org/jira/browse/NIFI-11636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Burgess updated NIFI-11636:
--------------------------------
    Fix Version/s: 2.0.0
                   1.22.0
                       (was: 1.latest)
                       (was: 2.latest)
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

> ParquetReader buffers up to 2 GB of content into heap unnecessarily
> -------------------------------------------------------------------
>
>                 Key: NIFI-11636
>                 URL: https://issues.apache.org/jira/browse/NIFI-11636
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 2.0.0, 1.22.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The Parquet Record Reader uses the NiFiSeekableInputStream. Because Parquet 
> requires reading the footer first, this class is intended to use 
> {{mark/reset}} so that we can read the footer and then reset back to the 
> beginning.
> To achieve this, it calls {{InputStream.mark(Integer.MAX_VALUE)}} which will 
> buffer up to 2 GB onto heap. However, the underlying InputStream is the 
> ContentClaimInputStream. The ContentClaimInputStream has smarts built into it 
> to allow resetting without having to buffer content into memory. In 
> particular, if you read over the {{limit}} provided and then call {{reset}} 
> it will close the InputStream and open a new InputStream from the beginning 
> of the FlowFIle content and seek to the desired offset.
> Because of this, we don't need to use {{InputStream.mark(Integer.MAX_VALUE)}} 
> and can instead use {{InputStream.mark(8192)}} or some similarly small value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to