lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413763701
########## File path: cpp/src/parquet/file_reader.cc ########## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector<int>& row_groups, + const std::vector<int>& column_indices, + const ::arrow::io::CacheOptions& options) { + cached_source_ = + std::make_shared<arrow::io::internal::ReadRangeCache>(source_, options); Review comment: No, on the contrary, there's only one instance of `Contents`, and hence a single shared cache right now (between all reads of the same file). However, lots of instances of `RowGroupReader::Contents` get created (one per row group per column), so it's not easy to cache each row group separately. Perhaps I'm missing the point: what you'd like is a way to stream record batches out of a file, and have it internally clean up memory for each row group once the data has been fully read, right? (And, not pre-buffer more than some fixed number of row groups ahead of the current one.) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org