[
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607287#comment-17607287
]
Weston Pace commented on ARROW-17599:
-------------------------------------
Ack. I was not really aware that was how the parquet reader operated. That
comment is very helpful. Hmm, in that case maybe a better fix is to improve
how we scan parquet files. Currently we get an async generator from a parquet
reader for the entire file. The code for it is
[here|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L1162].
This prebuffers the entire range of row groups before we even start reading.
In practice I think we only want to prebuffer a row group right before we're
ready to actually read that row group.
> [C++] ReadRangeCache should not retain data after read
> ------------------------------------------------------
>
> Key: ARROW-17599
> URL: https://issues.apache.org/jira/browse/ARROW-17599
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Percy Camilo TriveƱo Aucahuasi
> Priority: Major
> Labels: good-second-issue
>
> I've added a unit test of the issue here:
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes
> those files are quite large (gigabytes). The usage is roughly:
> for X in num_row_groups:
> CacheAllThePiecesWeNeedForRowGroupX
> WaitForPiecesToArriveForRowGroupX
> ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do
> not release the data for row group X. The read range cache's entries vector
> still holds a pointer to the buffer. The data is not released until the file
> reader itself is destroyed which only happens when we have finished
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single
> read range's cache entry could be shared by multiple ranges so we will need
> some kind of reference counting to know when we have fully finished with an
> entry and can release it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)