Weston Pace created ARROW-17599:
-----------------------------------

             Summary: [C++] ReadRangeCache should not retain data after read
                 Key: ARROW-17599
                 URL: https://issues.apache.org/jira/browse/ARROW-17599
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


I've added a unit test of the issue here: 
https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention

We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
those files are quite large (gigabytes).  The usage is roughly:

for X in num_row_groups:
  CacheAllThePiecesWeNeedForRowGroupX
  WaitForPiecesToArriveForRowGroupX
  ReadThePiecesWeNeedForRowGroupX

However, once we've read in row group X and passed it on to Acero, etc. we do 
not release the data for row group X.  The read range cache's entries vector 
still holds a pointer to the buffer.  The data is not released until the file 
reader itself is destroyed which only happens when we have finished processing 
an entire file.

This leads to excessive memory usage when pre-buffering is enabled.

This could potentially be a little difficult to implement because a single read 
range's cache entry could be shared by multiple ranges so we will need some 
kind of reference counting to know when we have fully finished with an entry 
and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to