adamreeve opened a new issue, #46971:
URL: https://github.com/apache/arrow/issues/46971

   ### Describe the enhancement requested
   
   I've been looking into memory usage when reading large (wide) Parquet files 
with the Arrow based API. Two configuration options that greatly reduce memory 
use are enabling a buffered stream and disabling pre-buffering (see 
https://github.com/apache/arrow/issues/46935).
   
   When reading encrypted Parquet files, another significant contributor to 
memory use are the [decryption 
buffers](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/parquet/column_reader.cc#L308).
 If data pages are compressed as well as encrypted, data is first decrypted 
into this buffer and then decompressed. The decrypted data is no longer needed 
after decompression, but the memory is held by each `SerializedPageReader` (one 
per encrypted column).
   
   I've tested changing these decryption buffers to be temporary, so that their 
memory can be returned to the memory pool after page decompression. This 
significantly reduces memory usage in my test case.
   
   My test case uses:
   * 12 GB parquet file with 453 columns (451 are float32, one is an int32 date 
and one is an int32 id)
   * All columns encrypted with one key
   * 6 row groups, with each row group except the last having 1,048,576 rows
   * Read two of the row groups as Arrow record batches
   
   I've profiled memory allocations with massif and setting 
`ARROW_DEFAULT_MEMORY_POOL=system` (massif doesn't seem to detect allocations 
from mimalloc).
   
   Here's the baseline results with pre-buffering disabled and a buffered 
reader, showing peak memory usage of 1.0 GiB:
   
   
![Image](https://github.com/user-attachments/assets/230c881f-cbb0-4602-bc73-d7a0cfbba991)
   
   The 453 MiB is one decompression buffer per column, where each uses 1 MiB 
(the page size). The 451 MiB is one decryption buffer per column, which uses 
slightly less than 453 MiB as these only need to hold compressed pages. The 
float data was randomly generated so doesn't compress well, but the id and date 
columns are very regular.
   
   After changing to use temporary decryption buffers, the peak memory usage 
drops to 580 MiB:
   
   
![Image](https://github.com/user-attachments/assets/7fe398e8-094f-4764-9c4e-6a5f8cc7270d)
   
   I assume the decryption buffers are held onto by the column readers to avoid 
the cost of reallocating them for each page, but this should hopefully be 
pretty cheap if using an allocator other than the system allocator. I'll follow 
up and run some benchmarks to verify this.
   
   For reference, I'm testing on the main branch at commit 
`ed13cedd8bf7ddc06db152f97e68d86c2c37e949`.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to