[
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin closed ARROW-17590.
-----------------------
Resolution: Duplicate
> Lower memory usage with filters
> -------------------------------
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Yin
> Priority: Major
> Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string
> columns with lots of None) with filter on a not null column for a small
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not
> only the footprint or watermark of memory usage is high, but also it seems
> not releasing the memory in time (such as after GC in Python, but may get
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1
> row or more or without filter, it takes about 10MB, which is quite smaller
> and better, but still bigger than the size of table or data frame with 1 or
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get
> the correct rows. But the memory usage is quite high even when the parquet
> file is not really large, partitioned or not. There were some references
> similar to this issue, for example:
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0)
> _ParquetDatasetV2.read
> self._dataset.to_table(columns=columns, filter=self._filter_expression,
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
> self._dataset.scanner(columns=columns,
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read
> filtering?
> If not supported yet, can it be fixed/improved with priority?
> This is a blocking issue for us when we need to load all or many columns.
> I am not sure what improvement is possible with respect to how the parquet
> columnar format works, and if it can be patched somehow in the Pyarrow Python
> code, or need to change and build the arrow C++ code.
> Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)