[jira] [Commented] (ARROW-18160) [C++] Scanner slicing large row groups leads to inefficient RAM usage

David Li (Jira) Tue, 25 Oct 2022 09:45:45 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623926#comment-17623926
 ]


David Li commented on ARROW-18160:
----------------------------------

The only way (without sub-group reads) is copying, right?

> [C++] Scanner slicing large row groups leads to inefficient RAM usage
> ---------------------------------------------------------------------
>
>                 Key: ARROW-18160
>                 URL: https://issues.apache.org/jira/browse/ARROW-18160
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> As an example, consider a 4GB parquet file with 1 giant row group.  At the 
> moment it is inevitable that we read this in as one large 4GB record batch 
> (there are other JIRAs for sub-row-group reads which, if implemented, would 
> obsolete this one).
> We then slice off pieces of that 4GB parquet file for processing:
> {noformat}
> next_batch = current.slice(0, batch_size)
> current = current.slice(batch_size)
> {noformat}
> However, even though {{current}} is shrinking each time, it always references 
> the entire data (slicing doesn't allow memory to be freed).  We may want to 
> investigate alternative strategies here so that we can free up memory when we 
> are done processing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18160) [C++] Scanner slicing large row groups leads to inefficient RAM usage

Reply via email to