lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413763701



##########
File path: cpp/src/parquet/file_reader.cc
##########
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
     file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector<int>& row_groups,
+                 const std::vector<int>& column_indices,
+                 const ::arrow::io::CacheOptions& options) {
+    cached_source_ =
+        std::make_shared<arrow::io::internal::ReadRangeCache>(source_, 
options);

Review comment:
       No, on the contrary, there's only one instance of `Contents`, and hence 
a single shared cache right now (between all reads of the same file). However, 
lots of instances of `RowGroupReader::Contents` get created (one per row group 
per column), so it's not easy to cache each row group separately.
   
   Perhaps I'm missing the point: what you'd like is a way to stream record 
batches out of a file, and have it internally clean up memory for each row 
group once the data has been fully read, right? (And, not pre-buffer more than 
some fixed number of row groups ahead of the current one.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to