[I] [C++][Parquet] Reduce memory usage required by decryption buffers when reading encrypted Parquet [arrow]

via GitHub Tue, 01 Jul 2025 21:00:29 -0700


adamreeve opened a new issue, #46971:
URL: https://github.com/apache/arrow/issues/46971

### Describe the enhancement requested

I've been looking into memory usage when reading large (wide) Parquet files
with the Arrow based API. Two configuration options that greatly reduce memory
use are enabling a buffered stream and disabling pre-buffering (see
https://github.com/apache/arrow/issues/46935).

When reading encrypted Parquet files, another significant contributor to
memory use are the [decryption
buffers](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/parquet/column_reader.cc#L308).
If data pages are compressed as well as encrypted, data is first decrypted
into this buffer and then decompressed. The decrypted data is no longer needed
after decompression, but the memory is held by each `SerializedPageReader` (one
per encrypted column).

I've tested changing these decryption buffers to be temporary, so that their
memory can be returned to the memory pool after page decompression. This
significantly reduces memory usage in my test case.

My test case uses:
* 12 GB parquet file with 453 columns (451 are float32, one is an int32 date
and one is an int32 id)
* All columns encrypted with one key
* 6 row groups, with each row group except the last having 1,048,576 rows
* Read two of the row groups as Arrow record batches

I've profiled memory allocations with massif and setting
`ARROW_DEFAULT_MEMORY_POOL=system` (massif doesn't seem to detect allocations
from mimalloc).

Here's the baseline results with pre-buffering disabled and a buffered
reader, showing peak memory usage of 1.0 GiB:

![Image](https://github.com/user-attachments/assets/230c881f-cbb0-4602-bc73-d7a0cfbba991)

The 453 MiB is one decompression buffer per column, where each uses 1 MiB
(the page size). The 451 MiB is one decryption buffer per column, which uses
slightly less than 453 MiB as these only need to hold compressed pages. The
float data was randomly generated so doesn't compress well, but the id and date
columns are very regular.

After changing to use temporary decryption buffers, the peak memory usage
drops to 580 MiB:

![Image](https://github.com/user-attachments/assets/7fe398e8-094f-4764-9c4e-6a5f8cc7270d)

I assume the decryption buffers are held onto by the column readers to avoid
the cost of reallocating them for each page, but this should hopefully be
pretty cheap if using an allocator other than the system allocator. I'll follow
up and run some benchmarks to verify this.

For reference, I'm testing on the main branch at commit
`ed13cedd8bf7ddc06db152f97e68d86c2c37e949`.

### Component(s)

C++, Parquet

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [C++][Parquet] Reduce memory usage required by decryption buffers when reading encrypted Parquet [arrow]

Reply via email to