bkietz commented on a change in pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#discussion_r445663612
##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -571,6 +571,8 @@ static inline Result<std::string> FileFromRowGroup(
}
}
+ // TODO Is it possible to infer the file size and return a populated
FileInfo?
+ // This could avoid some spurious HEAD requests on S3 (ARROW-8950)
Review comment:
If it's not easily accessible from FileMetaData or so then this probably
warrants a custom metadata field when writing `_metadata`.
##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -156,27 +177,26 @@ Result<std::shared_ptr<DatasetFactory>>
FileSystemDatasetFactory::Make(
ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfo(selector));
- std::vector<std::string> paths;
+ // XXX We may avoid copying around vectors of FileInfo by filtering in place.
+ std::vector<fs::FileInfo> filtered_files;
Review comment:
```suggestion
auto files_end = std::remove_if(files.begin(), files.end(), [&](const
FileInfo& info) {
if (!info.IsFile()) {
return true;
}
if (StartsWithAnyOf(info.path(), options.selector_ignore_prefixes)) {
return true;
}
return false;
});
files.erase(files_end, files.end());
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]