[jira] [Created] (ARROW-15410) [C++][Datasets] Improve memory usage of datasets API when scanning parquet

Weston Pace (Jira) Fri, 21 Jan 2022 17:41:06 -0800

Weston Pace created ARROW-15410:
-----------------------------------

             Summary: [C++][Datasets] Improve memory usage of datasets API when 
scanning parquet
                 Key: ARROW-15410
                 URL: https://issues.apache.org/jira/browse/ARROW-15410
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace
            Assignee: Weston Pace



This is a more targeted fix to improve memory usage when scanning parquet 
files.  It is related to broader issues like ARROW-14648 but those will likely 
take longer to fix.  The goal here is to make it possible to scan large parquet 
datasets with many files where each file has reasonably sized row groups (e.g. 
1 million rows).  Currently we run out of memory scanning a configuration as 
simple as:

21 parquet files
Each parquet file has 10 million rows split into row groups of size 1 million



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15410) [C++][Datasets] Improve memory usage of datasets API when scanning parquet

Reply via email to