[
https://issues.apache.org/jira/browse/ARROW-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace resolved ARROW-15410.
---------------------------------
Fix Version/s: 8.0.0
Resolution: Fixed
Issue resolved by pull request 12228
[https://github.com/apache/arrow/pull/12228]
> [C++][Datasets] Improve memory usage of datasets API when scanning parquet
> --------------------------------------------------------------------------
>
> Key: ARROW-15410
> URL: https://issues.apache.org/jira/browse/ARROW-15410
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Labels: pull-request-available
> Fix For: 8.0.0
>
> Time Spent: 4h 20m
> Remaining Estimate: 0h
>
> This is a more targeted fix to improve memory usage when scanning parquet
> files. It is related to broader issues like ARROW-14648 but those will
> likely take longer to fix. The goal here is to make it possible to scan
> large parquet datasets with many files where each file has reasonably sized
> row groups (e.g. 1 million rows). Currently we run out of memory scanning a
> configuration as simple as:
> 21 parquet files
> Each parquet file has 10 million rows split into row groups of size 1 million
--
This message was sent by Atlassian Jira
(v8.20.7#820007)