westonpace commented on a change in pull request #12625:
URL: https://github.com/apache/arrow/pull/12625#discussion_r841005135
##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -134,8 +135,26 @@ Result<std::shared_ptr<DatasetFactory>>
FileSystemDatasetFactory::Make(
Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
std::shared_ptr<fs::FileSystem> filesystem, const
std::vector<fs::FileInfo>& files,
std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options) {
+ // Discover files in directories and globs
+ std::vector<fs::FileInfo> discovered_files;
+ for (const auto& file : files) {
+ if (file.IsDirectory()) {
+ fs::FileSelector file_selector;
+ file_selector.base_dir = file.dir_name();
+ file_selector.recursive = true;
+ ARROW_ASSIGN_OR_RAISE(auto folder_files,
filesystem->GetFileInfo(file_selector));
+ std::move(folder_files.begin(), folder_files.end(),
+ std::back_inserter(discovered_files));
+ } else if (file.IsGlob()) {
+ ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfoGlob(file));
+ std::move(files.begin(), files.end(),
std::back_inserter(discovered_files));
+ } else if (file.IsFile()) {
+ discovered_files.emplace_back(file);
+ }
+ }
+
Review comment:
I'm not entirely sure what you are describing but I think it sounds ok.
Can you show an example of what the fields would look like? For example, I was
thinking:
```
struct ARROW_EXPORT FileSelector {
std::string base_dir;
bool allow_not_found;
bool recursive;
bool is_glob;
int32_t max_recursion;
};
```
Are you thinking of using an enum instead of having both `bool is_glob` and
`bool recursive`? We could have an `is_file` property but I don't know that we
would use it. If we know we have a file we can just use a filename. If we
have a path then we don't know if it is a file or not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]