lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271



##########
File path: cpp/src/parquet/file_reader.cc
##########
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
     file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector<int>& row_groups,
+                 const std::vector<int>& column_indices,
+                 const ::arrow::io::CacheOptions& options) {
+    cached_source_ =
+        std::make_shared<arrow::io::internal::ReadRangeCache>(source_, 
options);

Review comment:
       GitHub acting up, sorry for the delay...
   
   > > Correct me if I'm wrong, though, but even that doesn't help much without 
more refactoring, since reading is organized along columns.
   > 
   > That was my original point, which you said was irrelevant...
   
   Apologies, I remember you had made this point on JIRA or the mailing list 
and was trying to find where (I still can't find where :/) I admit my 
understanding of the internals is poor, and your comment helped me understand 
what to look for better.
   
   > I'm afraid I don't understand you. How do you "read one row group at a 
time" if the cache is shared at the FileReader level? Do you mean creating a 
new Parquet file reader for each row group?
   
   Sorry, I was thinking about creating RowGroupReaders from a FileReader, one 
row group at a time: 
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi
   
   Each time this is called, it internally recreates the cache and buffers a 
new set of ranges. (Hence the comment above about "unsafe".) So this would let 
you read a row group's worth of data at a time, at the cost of more complexity 
for the user.
   
   > To be clear, I don't think this concern should block the PR, but the 
docstring should warn about it.
   
   I'll update it. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to