Weston Pace created ARROW-16451:
-----------------------------------
Summary: [C++] ParquetFileFragment caches parquet file metadata
and there is no way to disable this
Key: ARROW-16451
URL: https://issues.apache.org/jira/browse/ARROW-16451
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
When looking at ARROW-15081 there was a strange amount of memory used even when
we were accumulating all of the results into a single 64 byte counter (e.g.
{{SELECT COUNT(*) FROM table}}).
It turns out this was the parquet metadata, which gets attached to the parquet
file fragment. There is no way to prevent this and, in this case, it was using
quite a bit of RAM. There were 1100 files and each file had ~10MB of metadata.
We should have an option for disabling this. Also, this should probably be off
by default. It can be a useful thing to cache if you are going to run the same
dataset again and again but otherwise it is just wasted RAM.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)