Weston Pace created ARROW-15410:
-----------------------------------

             Summary: [C++][Datasets] Improve memory usage of datasets API when 
scanning parquet
                 Key: ARROW-15410
                 URL: https://issues.apache.org/jira/browse/ARROW-15410
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace
            Assignee: Weston Pace


This is a more targeted fix to improve memory usage when scanning parquet 
files.  It is related to broader issues like ARROW-14648 but those will likely 
take longer to fix.  The goal here is to make it possible to scan large parquet 
datasets with many files where each file has reasonably sized row groups (e.g. 
1 million rows).  Currently we run out of memory scanning a configuration as 
simple as:

21 parquet files
Each parquet file has 10 million rows split into row groups of size 1 million



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to