Weston Pace created ARROW-15410:
-----------------------------------
Summary: [C++][Datasets] Improve memory usage of datasets API when
scanning parquet
Key: ARROW-15410
URL: https://issues.apache.org/jira/browse/ARROW-15410
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
Assignee: Weston Pace
This is a more targeted fix to improve memory usage when scanning parquet
files. It is related to broader issues like ARROW-14648 but those will likely
take longer to fix. The goal here is to make it possible to scan large parquet
datasets with many files where each file has reasonably sized row groups (e.g.
1 million rows). Currently we run out of memory scanning a configuration as
simple as:
21 parquet files
Each parquet file has 10 million rows split into row groups of size 1 million
--
This message was sent by Atlassian Jira
(v8.20.1#820001)