[
https://issues.apache.org/jira/browse/ARROW-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623926#comment-17623926
]
David Li commented on ARROW-18160:
----------------------------------
The only way (without sub-group reads) is copying, right?
> [C++] Scanner slicing large row groups leads to inefficient RAM usage
> ---------------------------------------------------------------------
>
> Key: ARROW-18160
> URL: https://issues.apache.org/jira/browse/ARROW-18160
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> As an example, consider a 4GB parquet file with 1 giant row group. At the
> moment it is inevitable that we read this in as one large 4GB record batch
> (there are other JIRAs for sub-row-group reads which, if implemented, would
> obsolete this one).
> We then slice off pieces of that 4GB parquet file for processing:
> {noformat}
> next_batch = current.slice(0, batch_size)
> current = current.slice(batch_size)
> {noformat}
> However, even though {{current}} is shrinking each time, it always references
> the entire data (slicing doesn't allow memory to be freed). We may want to
> investigate alternative strategies here so that we can free up memory when we
> are done processing it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)