[ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17602893#comment-17602893
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-17599:
--------------------------------------------------------

Should 
[ReadRangeCache::read|https://github.com/westonpace/arrow/blob/5cf0deaf82f090718350fd3d8d0ee5c4795df7a0/cpp/src/arrow/io/caching.cc#L193]
 remove the cache entry after performing the read?

>From the documentation, is not clear if that method should remove the cache 
>entry; I did a simple experiment (removing the range) and the [unit 
>test|https://github.com/westonpace/arrow/blob/5cf0deaf82f090718350fd3d8d0ee5c4795df7a0/cpp/src/arrow/io/memory_test.cc#L772]
> provided by [~westonpace] is passing:

 
{code:java}
if (it != entries.end() && it->range.Contains(range)) {
  ...
  this->entries.erase(it);
   ...
}
{code}
 

This is just an experiment to understand better the issue.

Also, I tried to explore [~lidavidm]'s idea, but I think I need more hints 
about how we can store each cache entry as a custom buffer; so far what I 
understand is that the data is being wrapped/eaten by the RandomAccessFile and 
that is the reason why the release won't happen until the file reader is 
destroyed (there is no way to access to the internal data buffer held by 
RandomAccessFile) 

Weston, it would be great to know full use case you were running; right now I'm 
using the unit test, but it would help to replicate the issue with the full use 
case locally (maybe the use case needs an override method for 
ReadRangeCache::read that can delete the range at the end)

> [C++] ReadRangeCache should not retain data after read
> ------------------------------------------------------
>
>                 Key: ARROW-17599
>                 URL: https://issues.apache.org/jira/browse/ARROW-17599
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Percy Camilo Triveño Aucahuasi
>            Priority: Major
>              Labels: good-second-issue
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to