[
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641061#comment-17641061
]
Percy Camilo Triveño Aucahuasi commented on ARROW-17599:
--------------------------------------------------------
Now that we have
[RandomAccessFile::ReadManyAsync|https://github.com/apache/arrow/pull/14723], I
would like to start gathering some ideas about how to implement the fix for
this ticket by using the new capability.
In the [previous attempt|https://github.com/apache/arrow/pull/14226/files], we
discovered that the pre-buffering process doesn't handle concurrent use and
that a better long-term solution would be to separate caching and coalescing in
ReadRangeCache.
Currently, I think we still need to wait for this
[PR|https://github.com/apache/arrow/pull/14747] (and maybe another one too)
before start working on this ticket, but it would be great to start discussing
about how to use this new API for solving this issue here.
> [C++] ReadRangeCache should not retain data after read
> ------------------------------------------------------
>
> Key: ARROW-17599
> URL: https://issues.apache.org/jira/browse/ARROW-17599
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Percy Camilo Triveño Aucahuasi
> Priority: Major
> Labels: good-second-issue, pull-request-available
> Time Spent: 4h
> Remaining Estimate: 0h
>
> I've added a unit test of the issue here:
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes
> those files are quite large (gigabytes). The usage is roughly:
> for X in num_row_groups:
> CacheAllThePiecesWeNeedForRowGroupX
> WaitForPiecesToArriveForRowGroupX
> ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do
> not release the data for row group X. The read range cache's entries vector
> still holds a pointer to the buffer. The data is not released until the file
> reader itself is destroyed which only happens when we have finished
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single
> read range's cache entry could be shared by multiple ranges so we will need
> some kind of reference counting to know when we have fully finished with an
> entry and can release it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)