Weston Pace created ARROW-18160:
-----------------------------------
Summary: [C++] Scanner slicing large row groups leads to
inefficient RAM usage
Key: ARROW-18160
URL: https://issues.apache.org/jira/browse/ARROW-18160
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
As an example, consider a 4GB parquet file with 1 giant row group. At the
moment it is inevitable that we read this in as one large 4GB record batch
(there are other JIRAs for sub-row-group reads which, if implemented, would
obsolete this one).
We then slice off pieces of that 4GB parquet file for processing:
{noformat}
next_batch = current.slice(0, batch_size)
current = current.slice(batch_size)
{noformat}
However, even though {{current}} is shrinking each time, it always references
the entire data (slicing doesn't allow memory to be freed). We may want to
investigate alternative strategies here so that we can free up memory when we
are done processing it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)