lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271
##########
File path: cpp/src/parquet/file_reader.cc
##########
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
file_metadata_ = std::move(metadata);
}
+ void PreBuffer(const std::vector<int>& row_groups,
+ const std::vector<int>& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+ cached_source_ =
+ std::make_shared<arrow::io::internal::ReadRangeCache>(source_,
options);
Review comment:
GitHub acting up, sorry for the delay...
> > Correct me if I'm wrong, though, but even that doesn't help much without
more refactoring, since reading is organized along columns.
>
> That was my original point, which you said was irrelevant...
Apologies, I remember you had made this point on JIRA or the mailing list
and was trying to find where (I still can't find where :/) I admit my
understanding of the internals is poor, and your comment helped me understand
what to look for better.
> I'm afraid I don't understand you. How do you "read one row group at a
time" if the cache is shared at the FileReader level? Do you mean creating a
new Parquet file reader for each row group?
Sorry, I was thinking about creating RowGroupReaders from a FileReader, one
row group at a time:
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi
Each time this is called, it internally recreates the cache and buffers a
new set of ranges. (Hence the comment above about "unsafe".) So this would let
you read a row group's worth of data at a time, at the cost of more complexity
for the user.
> To be clear, I don't think this concern should block the PR, but the
docstring should warn about it.
I'll update it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]