[ https://issues.apache.org/jira/browse/NIFI-11636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Burgess updated NIFI-11636: -------------------------------- Fix Version/s: 2.0.0 1.22.0 (was: 1.latest) (was: 2.latest) Resolution: Fixed Status: Resolved (was: Patch Available) > ParquetReader buffers up to 2 GB of content into heap unnecessarily > ------------------------------------------------------------------- > > Key: NIFI-11636 > URL: https://issues.apache.org/jira/browse/NIFI-11636 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Reporter: Mark Payne > Assignee: Mark Payne > Priority: Major > Fix For: 2.0.0, 1.22.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The Parquet Record Reader uses the NiFiSeekableInputStream. Because Parquet > requires reading the footer first, this class is intended to use > {{mark/reset}} so that we can read the footer and then reset back to the > beginning. > To achieve this, it calls {{InputStream.mark(Integer.MAX_VALUE)}} which will > buffer up to 2 GB onto heap. However, the underlying InputStream is the > ContentClaimInputStream. The ContentClaimInputStream has smarts built into it > to allow resetting without having to buffer content into memory. In > particular, if you read over the {{limit}} provided and then call {{reset}} > it will close the InputStream and open a new InputStream from the beginning > of the FlowFIle content and seek to the desired offset. > Because of this, we don't need to use {{InputStream.mark(Integer.MAX_VALUE)}} > and can instead use {{InputStream.mark(8192)}} or some similarly small value. -- This message was sent by Atlassian Jira (v8.20.10#820010)