[jira] [Commented] (ARROW-17599) [C++] ReadRangeCache should not retain data after read

Jira Tue, 20 Sep 2022 07:57:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607252#comment-17607252
 ]


Percy Camilo Triveño Aucahuasi commented on ARROW-17599:
--------------------------------------------------------

Thanks Weston,

>> Should ReadRangeCache::read remove the cache entry after performing the read?

>Yes. I don't think this is mentioned in the documentation. It may not have 
>been a concern at the time. I think we should also update the documentation so 
>that we are very clear that this happens.

It seems that 
[ParquetFileReader::PreBuffer|https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/parquet/file_reader.h#L156]
 was implemented under a different assumption, from the API docs:

_"After calling this, creating readers for row groups/column indices that were 
not buffered may fail. {*}Creating multiple readers for the a subset of the 
buffered regions is acceptable{*}. This may be called again to buffer a 
different set of row groups/columns."_

I did run the script provided in ARROW-17590 and was able to reproduce the 
issue.
Also, I was able to check that we are reading multiple times the same cache 
entry and observe that removing the entry after ReadRangeCache::read is 
breaking the contract required by ParquetFileReader::PreBuffer.

I'll keep investigating, any other ideas are more than welcome!

> [C++] ReadRangeCache should not retain data after read
> ------------------------------------------------------
>
>                 Key: ARROW-17599
>                 URL: https://issues.apache.org/jira/browse/ARROW-17599
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Percy Camilo Triveño Aucahuasi
>            Priority: Major
>              Labels: good-second-issue
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17599) [C++] ReadRangeCache should not retain data after read

Reply via email to