[ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603200#comment-17603200
 ] 

Weston Pace commented on ARROW-17599:
-------------------------------------

> Should ReadRangeCache::read remove the cache entry after performing the read?

Yes.  I don't think this is mentioned in the documentation.  It may not have 
been a concern at the time.  I think we should also update the documentation so 
that we are very clear that this happens.

> Also, I tried to explore David Li's idea, but I think I need more hints about 
> how we can store each cache entry as a custom buffer; so far what I 
> understand is that the data is being wrapped/eaten by the RandomAccessFile 
> and that is the reason why the release won't happen until the file reader is 
> destroyed (there is no way to access to the internal data buffer held by 
> RandomAccessFile) 

The FileReader owns a single instance of ReadRangeCache.  That instance won't 
be deleted until the FileReader is deleted.
The ReadRangeCache has a vector of RangeCacheEntry.  Currently, nothing removes 
items from that vector.  A RangeCacheEntry has a 
{{Future<std::shared_ptr<Buffer>>}}.  Once that future has been filled it will 
hold the result (in case callbacks are added later) and so this will keep the 
buffer alive (because there is still a shared_ptr referencing it).

> Weston, it would be great to know full use case you were running; right now 
> I'm using the unit test, but it would help to replicate the issue with the 
> full use case locally (maybe the use case needs an override method for 
> ReadRangeCache::read that can delete the range at the end)

The use case is described in more detail in 
https://issues.apache.org/jira/browse/ARROW-17590 (which has a reproducing 
script) but a slightly more involved test would be:

Create a 4GiB parquet file with 20 row groups.  Each row group should be about 
200MiB.  Scan the file with pyarrow to_batches (just count the rows or 
something).  The scanner should only read at most 2 row groups in at a time.  
So I'd expect to see around 0.5GiB peak RAM.  However, in practice, you will 
see 4GiB peak RAM.

> [C++] ReadRangeCache should not retain data after read
> ------------------------------------------------------
>
>                 Key: ARROW-17599
>                 URL: https://issues.apache.org/jira/browse/ARROW-17599
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Percy Camilo TriveƱo Aucahuasi
>            Priority: Major
>              Labels: good-second-issue
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to