adamreeve opened a new issue, #46935: URL: https://github.com/apache/arrow/issues/46935
### Describe the bug, including details regarding any error messages, version, and platform. When reading a Parquet file with `FileReader::GetRecordBatchReader` and default options, memory usage will increase while iterating over batches and reach the size of all data in the file. This may be expected behaviour, but it was quite surprising to me so I want to open this issue to discuss whether this can be improved. I have my test code in a branch on my fork: https://github.com/apache/arrow/compare/main...adamreeve:arrow:mem_use_test Writing a test file with 100 row groups of 40 MB each (4 GB total): <details> ```c++ TEST(TestStreamFile, WriteFile) { const std::string file_path = "/tmp/stream_test.parquet"; constexpr int64_t num_row_groups = 100; constexpr int64_t rows_per_row_group = 1000000; constexpr int64_t num_columns = 10; PARQUET_ASSIGN_OR_THROW( const std::shared_ptr<::arrow::io::FileOutputStream> file, ::arrow::io::FileOutputStream::Open(file_path)); WriterProperties::Builder writer_properties_builder; auto writer_properties = writer_properties_builder.build(); std::vector<NodePtr> fields; for (auto col_idx = 0; col_idx < num_columns; ++col_idx) { fields.push_back(PrimitiveNode::Make("x" + std::to_string(col_idx), Repetition::REQUIRED, Type::FLOAT)); } auto schema = std::dynamic_pointer_cast<schema::GroupNode>(schema::GroupNode::Make( "root", Repetition::REQUIRED, fields)); std::unique_ptr<ParquetFileWriter> writer = ParquetFileWriter::Open(file, schema, writer_properties, nullptr); std::vector<float> buffer(rows_per_row_group); for (auto row_group_idx = 0; row_group_idx < num_row_groups; ++row_group_idx) { auto row_group = writer->AppendRowGroup(); for (auto col_idx = 0; col_idx < num_columns; ++col_idx) { ::arrow::random_real(rows_per_row_group, row_group_idx * num_columns + col_idx, -1.0, 1.0, &buffer); auto column_writer = row_group->NextColumn(); auto& float_column_writer = dynamic_cast<FloatWriter&>(*column_writer); float_column_writer.WriteBatch(rows_per_row_group, nullptr, nullptr, buffer.data()); } row_group->Close(); } writer->Close(); } ``` </details> Reading the file: <details> ```c++ TEST(TestStreamFile, ReadFile) { const std::string file_path = "/tmp/stream_test.parquet"; PARQUET_ASSIGN_OR_THROW( std::shared_ptr<::arrow::io::ReadableFile> input_file, ::arrow::io::ReadableFile::Open(file_path, ::arrow::default_memory_pool())); ReaderProperties reader_properties; ArrowReaderProperties arrow_reader_properties; //arrow_reader_properties.set_pre_buffer(false); FileReaderBuilder builder; PARQUET_THROW_NOT_OK(builder.Open(input_file, reader_properties)); builder.properties(arrow_reader_properties); int batchesRead = 0; int64_t maxRss = 0; { std::unique_ptr<FileReader> reader; PARQUET_THROW_NOT_OK(builder.Build(&reader)); PARQUET_ASSIGN_OR_THROW( std::shared_ptr<::arrow::RecordBatchReader> batch_reader, reader->GetRecordBatchReader()); while (true) { std::shared_ptr<::arrow::RecordBatch> batch; PARQUET_THROW_NOT_OK(batch_reader->ReadNext(&batch)); if (batch == nullptr) { break; } int64_t rss = ::arrow::internal::GetCurrentRSS(); std::cout << "Batch " << batchesRead << ", RSS = " << (rss / (double)(1024 * 1024)) << " MB" << std::endl; maxRss = std::max(maxRss, rss); batchesRead++; } } ``` </details> When running the read test, the output looks like: ``` Batch 0, RSS = 1151.82 MB Batch 1, RSS = 1151.82 MB ... Batch 318, RSS = 1151.82 MB Batch 319, RSS = 1151.82 MB Batch 320, RSS = 2175.82 MB Batch 321, RSS = 2175.82 MB ... Batch 669, RSS = 2175.82 MB Batch 670, RSS = 2175.82 MB Batch 671, RSS = 3199.82 MB Batch 672, RSS = 3199.82 MB ... Batch 1005, RSS = 3199.82 MB Batch 1006, RSS = 3199.82 MB Batch 1007, RSS = 4223.82 MB Batch 1008, RSS = 4223.82 MB ... Batch 1340, RSS = 4223.82 MB Batch 1341, RSS = 4223.82 MB Batch 1342, RSS = 5247.82 MB Batch 1343, RSS = 5247.82 MB ... Batch 1524, RSS = 5247.82 MB Batch 1525, RSS = 5247.82 MB Read 1526 batches Max RSS = 5247.82 MB ``` From some experimenting, I found that disabling pre-buffering (uncommenting [this line](https://github.com/adamreeve/arrow/blob/bfd05037fce201ead24cadd7b1bb9cc7a09f56d7/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L876)) greatly reduces memory use: ``` Batch 0, RSS = 1079.81 MB Batch 1, RSS = 1079.8 MB ... Batch 1524, RSS = 1079.8 MB Batch 1525, RSS = 1079.8 MB Read 1526 batches Max RSS = 1079.81 MB ``` This memory use still looked a bit high to me, but the max RSS reported by `/usr/bin/time -v` was a lot lower, at about 94 MB and 4.7 GB with pre-buffering. From looking at the code, I can see there is a [cache of futures of buffers](https://github.com/apache/arrow/blob/f8cd17c0651e4886a08b2664ec8e0a0fff09eaa2/cpp/src/arrow/io/caching.cc#L155) in the `ReadRangeCache::Impl`. Unless I'm missing something, it looks like once a buffer is stored in this cache, it's never removed, which explains the memory usage behaviour. Should buffers be evicted from this cache once they've been read to reduce memory usage? ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org