[I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]

via GitHub Sun, 29 Jun 2025 21:08:53 -0700


adamreeve opened a new issue, #46935:
URL: https://github.com/apache/arrow/issues/46935


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When reading a Parquet file with `FileReader::GetRecordBatchReader` and 
default options, memory usage will increase while iterating over batches and 
reach the size of all data in the file.
   
   This may be expected behaviour, but it was quite surprising to me so I want 
to open this issue to discuss whether this can be improved.
   
   I have my test code in a branch on my fork: 
https://github.com/apache/arrow/compare/main...adamreeve:arrow:mem_use_test
   
   Writing a test file with 100 row groups of 40 MB each (4 GB total):
   <details>
   
   ```c++
   TEST(TestStreamFile, WriteFile) {
     const std::string file_path = "/tmp/stream_test.parquet";
     constexpr int64_t num_row_groups = 100;
     constexpr int64_t rows_per_row_group = 1000000;
     constexpr int64_t num_columns = 10;
   
     PARQUET_ASSIGN_OR_THROW(
       const std::shared_ptr<::arrow::io::FileOutputStream> file, 
::arrow::io::FileOutputStream::Open(file_path));
   
     WriterProperties::Builder writer_properties_builder;
     auto writer_properties = writer_properties_builder.build();
   
     std::vector<NodePtr> fields;
     for (auto col_idx = 0; col_idx < num_columns; ++col_idx) {
       fields.push_back(PrimitiveNode::Make("x" + std::to_string(col_idx), 
Repetition::REQUIRED, Type::FLOAT));
     }
     auto schema = 
std::dynamic_pointer_cast<schema::GroupNode>(schema::GroupNode::Make(
         "root", Repetition::REQUIRED, fields));
     std::unique_ptr<ParquetFileWriter> writer = ParquetFileWriter::Open(file, 
schema, writer_properties, nullptr);
   
     std::vector<float> buffer(rows_per_row_group);
     for (auto row_group_idx = 0; row_group_idx < num_row_groups; 
++row_group_idx) {
       auto row_group = writer->AppendRowGroup();
       for (auto col_idx = 0; col_idx < num_columns; ++col_idx) {
         ::arrow::random_real(rows_per_row_group, row_group_idx * num_columns + 
col_idx, -1.0, 1.0, &buffer);
         auto column_writer = row_group->NextColumn();
         auto& float_column_writer = dynamic_cast<FloatWriter&>(*column_writer);
         float_column_writer.WriteBatch(rows_per_row_group, nullptr, nullptr, 
buffer.data());
       }
       row_group->Close();
     }
   
   
     writer->Close();
   } 
   ```
   
   </details>
   
   Reading the file:
   <details>
   
   ```c++
   TEST(TestStreamFile, ReadFile) {
     const std::string file_path = "/tmp/stream_test.parquet";
     PARQUET_ASSIGN_OR_THROW(
       std::shared_ptr<::arrow::io::ReadableFile> input_file, 
::arrow::io::ReadableFile::Open(file_path, ::arrow::default_memory_pool()));
   
     ReaderProperties reader_properties;
     ArrowReaderProperties arrow_reader_properties;
     //arrow_reader_properties.set_pre_buffer(false);
   
     FileReaderBuilder builder;
     PARQUET_THROW_NOT_OK(builder.Open(input_file, reader_properties));
     builder.properties(arrow_reader_properties);
   
     int batchesRead = 0;
     int64_t maxRss = 0;
     {
       std::unique_ptr<FileReader> reader;
       PARQUET_THROW_NOT_OK(builder.Build(&reader));
   
       PARQUET_ASSIGN_OR_THROW(
         std::shared_ptr<::arrow::RecordBatchReader> batch_reader, 
reader->GetRecordBatchReader());
   
       while (true) {
         std::shared_ptr<::arrow::RecordBatch> batch;
         PARQUET_THROW_NOT_OK(batch_reader->ReadNext(&batch));
         if (batch == nullptr) {
           break;
         }
         int64_t rss = ::arrow::internal::GetCurrentRSS();
         std::cout << "Batch " << batchesRead << ", RSS = " << (rss / 
(double)(1024 * 1024)) << " MB" << std::endl;
         maxRss = std::max(maxRss, rss);
         batchesRead++;
       }
     }
   ```
   
   </details>
   
   When running the read test, the output looks like:
   ```
   Batch 0, RSS = 1151.82 MB
   Batch 1, RSS = 1151.82 MB
   ...
   Batch 318, RSS = 1151.82 MB
   Batch 319, RSS = 1151.82 MB
   Batch 320, RSS = 2175.82 MB
   Batch 321, RSS = 2175.82 MB
   ...
   Batch 669, RSS = 2175.82 MB
   Batch 670, RSS = 2175.82 MB
   Batch 671, RSS = 3199.82 MB
   Batch 672, RSS = 3199.82 MB
   ...
   Batch 1005, RSS = 3199.82 MB
   Batch 1006, RSS = 3199.82 MB
   Batch 1007, RSS = 4223.82 MB
   Batch 1008, RSS = 4223.82 MB
   ...
   Batch 1340, RSS = 4223.82 MB
   Batch 1341, RSS = 4223.82 MB
   Batch 1342, RSS = 5247.82 MB
   Batch 1343, RSS = 5247.82 MB
   ...
   Batch 1524, RSS = 5247.82 MB
   Batch 1525, RSS = 5247.82 MB
   Read 1526 batches
   Max RSS = 5247.82 MB
   ```
   
   From some experimenting, I found that disabling pre-buffering (uncommenting 
[this 
line](https://github.com/adamreeve/arrow/blob/bfd05037fce201ead24cadd7b1bb9cc7a09f56d7/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L876))
 greatly reduces memory use:
   ```
   Batch 0, RSS = 1079.81 MB
   Batch 1, RSS = 1079.8 MB
   ...
   Batch 1524, RSS = 1079.8 MB
   Batch 1525, RSS = 1079.8 MB
   Read 1526 batches
   Max RSS = 1079.81 MB
   ```
   
   This memory use still looked a bit high to me, but the max RSS reported by 
`/usr/bin/time -v` was a lot lower, at about 94 MB and 4.7 GB with 
pre-buffering. 
   
   From looking at the code, I can see there is a [cache of futures of 
buffers](https://github.com/apache/arrow/blob/f8cd17c0651e4886a08b2664ec8e0a0fff09eaa2/cpp/src/arrow/io/caching.cc#L155)
 in the `ReadRangeCache::Impl`.
   
   Unless I'm missing something, it looks like once a buffer is stored in this 
cache, it's never removed, which explains the memory usage behaviour. Should 
buffers be evicted from this cache once they've been read to reduce memory 
usage?
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]

Reply via email to