lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271
########## File path: cpp/src/parquet/file_reader.cc ########## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector<int>& row_groups, + const std::vector<int>& column_indices, + const ::arrow::io::CacheOptions& options) { + cached_source_ = + std::make_shared<arrow::io::internal::ReadRangeCache>(source_, options); Review comment: GitHub acting up, sorry for the delay... > > Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. > > That was my original point, which you said was irrelevant... Apologies, I remember you had made this point on JIRA or the mailing list and was trying to find where (I still can't find where :/) I admit my understanding of the internals is poor, and your comment helped me understand what to look for better. > I'm afraid I don't understand you. How do you "read one row group at a time" if the cache is shared at the FileReader level? Do you mean creating a new Parquet file reader for each row group? Sorry, I was thinking about creating RowGroupReaders from a FileReader, one row group at a time: https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi Each time this is called, it internally recreates the cache and buffers a new set of ranges. (Hence the comment above about "unsafe".) So this would let you read a row group's worth of data at a time, at the cost of more complexity for the user. > To be clear, I don't think this concern should block the PR, but the docstring should warn about it. I'll update it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org