westonpace commented on a change in pull request #12625:
URL: https://github.com/apache/arrow/pull/12625#discussion_r840935751
##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -18,6 +18,7 @@
#include "arrow/dataset/discovery.h"
#include <algorithm>
+#include <iostream>
Review comment:
```suggestion
```
##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -192,6 +193,8 @@ class ARROW_EXPORT FileSystem : public
std::enable_shared_from_this<FileSystem>
/// it exists.
/// If it doesn't exist, see `FileSelector::allow_not_found`.
virtual Result<FileInfoVector> GetFileInfo(const FileSelector& select) = 0;
+ // Same, for glob paths.
Review comment:
The comments here are a little too brief but I created
https://issues.apache.org/jira/browse/ARROW-16101 to address this more fully so
let's keep it how it is for now.
##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -18,6 +18,7 @@
#include "arrow/dataset/discovery.h"
#include <algorithm>
+#include <iostream>
Review comment:
```suggestion
```
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -663,19 +664,26 @@ std::wstring PathWithoutTrailingSlash(const
PlatformFilename& fn) {
return path;
}
-Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename&
dir_path) {
+Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename&
dir_path,
+ bool is_glob = false) {
WIN32_FIND_DATAW find_data;
- std::wstring pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+ std::wstring pattern;
+ if (is_glob)
+ pattern = dir_path.ToNative();
+ else
+ pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+
+ std::vector<WIN32_FIND_DATAW> results;
HANDLE handle = FindFirstFileW(pattern.c_str(), &find_data);
if (handle == INVALID_HANDLE_VALUE) {
+ if (is_glob) return results;
Review comment:
Why is this ok? Can you add a comment explaining what is going on here?
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const
PlatformFilename& dir_path)
return results;
}
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t
delimiter) {
+ std::vector<NativePathString> result_vec;
Review comment:
Minor nit: We try to avoid return variables named "result" and also try
to avoid appending type-suffixes like `_vec`. Maybe just
`std::vector<NativePathString> segments;`?
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const
PlatformFilename& dir_path)
return results;
}
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t
delimiter) {
Review comment:
```suggestion
std::vector<NativePathString> SplitStr(const NativePathString& str, wchar_t
delimiter) {
```
##########
File path: cpp/src/arrow/filesystem/filesystem.cc
##########
@@ -348,6 +350,10 @@ Result<std::vector<FileInfo>>
SubTreeFileSystem::GetFileInfo(const FileSelector&
return infos;
}
+Result<std::vector<FileInfo>> SubTreeFileSystem::GetFileInfoGlob(const
FileInfo& file) {
+ return Status::NotImplemented("Glob file with SubTreeFileSystem");
Review comment:
Can we just prepend the glob pattern with `base_dir`?
##########
File path: cpp/src/arrow/filesystem/filesystem.cc
##########
@@ -348,6 +350,10 @@ Result<std::vector<FileInfo>>
SubTreeFileSystem::GetFileInfo(const FileSelector&
return infos;
}
+Result<std::vector<FileInfo>> SubTreeFileSystem::GetFileInfoGlob(const
FileInfo& file) {
+ return Status::NotImplemented("Glob file with SubTreeFileSystem");
Review comment:
Can we just prepend the glob pattern with `base_dir`?
##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -192,6 +193,8 @@ class ARROW_EXPORT FileSystem : public
std::enable_shared_from_this<FileSystem>
/// it exists.
/// If it doesn't exist, see `FileSelector::allow_not_found`.
virtual Result<FileInfoVector> GetFileInfo(const FileSelector& select) = 0;
+ // Same, for glob paths.
Review comment:
The comments here are a little too brief but I created
https://issues.apache.org/jira/browse/ARROW-16101 to address this more fully so
let's keep it how it is for now.
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const
PlatformFilename& dir_path)
return results;
}
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t
delimiter) {
+ std::vector<NativePathString> result_vec;
Review comment:
Minor nit: We try to avoid return variables named "result" and also try
to avoid appending type-suffixes like `_vec`. Maybe just
`std::vector<NativePathString> segments;`?
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -759,6 +795,24 @@ Result<std::vector<PlatformFilename>> ListDir(const
PlatformFilename& dir_path)
return results;
}
+Result<std::vector<PlatformFilename>> ListGlob(const PlatformFilename&
glob_path) {
+ glob_t glob_result{};
+ errno = glob(glob_path.ToNative().c_str(), GLOB_TILDE, NULL, &glob_result);
+
+ if (errno != 0 && errno != GLOB_NOMATCH) {
+ return IOErrorFromErrno(errno, "Cannot list glob '", glob_path.ToString(),
"'");
+ }
+
+ std::vector<PlatformFilename> results;
+ for (size_t i = 0; i < glob_result.gl_pathc; ++i) {
+ results.emplace_back(std::string(glob_result.gl_pathv[i]));
Review comment:
```suggestion
results.emplace_back(glob_result.gl_pathv[i]);
```
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -759,6 +795,24 @@ Result<std::vector<PlatformFilename>> ListDir(const
PlatformFilename& dir_path)
return results;
}
+Result<std::vector<PlatformFilename>> ListGlob(const PlatformFilename&
glob_path) {
+ glob_t glob_result{};
+ errno = glob(glob_path.ToNative().c_str(), GLOB_TILDE, NULL, &glob_result);
+
+ if (errno != 0 && errno != GLOB_NOMATCH) {
+ return IOErrorFromErrno(errno, "Cannot list glob '", glob_path.ToString(),
"'");
+ }
+
+ std::vector<PlatformFilename> results;
+ for (size_t i = 0; i < glob_result.gl_pathc; ++i) {
+ results.emplace_back(std::string(glob_result.gl_pathv[i]));
Review comment:
```suggestion
results.emplace_back(glob_result.gl_pathv[i]);
```
##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -134,8 +135,26 @@ Result<std::shared_ptr<DatasetFactory>>
FileSystemDatasetFactory::Make(
Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
std::shared_ptr<fs::FileSystem> filesystem, const
std::vector<fs::FileInfo>& files,
std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options) {
+ // Discover files in directories and globs
+ std::vector<fs::FileInfo> discovered_files;
+ for (const auto& file : files) {
+ if (file.IsDirectory()) {
+ fs::FileSelector file_selector;
+ file_selector.base_dir = file.dir_name();
+ file_selector.recursive = true;
+ ARROW_ASSIGN_OR_RAISE(auto folder_files,
filesystem->GetFileInfo(file_selector));
+ std::move(folder_files.begin(), folder_files.end(),
+ std::back_inserter(discovered_files));
+ } else if (file.IsGlob()) {
+ ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfoGlob(file));
+ std::move(files.begin(), files.end(),
std::back_inserter(discovered_files));
+ } else if (file.IsFile()) {
+ discovered_files.emplace_back(file);
+ }
+ }
+
Review comment:
So this is something of an abstract concern but I think we have a bit of
ambiguity between input & output types.
I see `FileInfo` as an "output type". It is something you get back from a
filesystem that describes something that it went out and discovered or
inspected.
In the meantime, PlatformFilename and FileSelector are "input" types. They
are provided by the user for the purpose of obtaining FileInfo objects.
As a result I'm not sure it makes sense to have `FileType::Glob` because I
would interpret that as the filesystem discovering a glob inside the
filesystem. For example `GetFileInfo` would never return `FileType::Glob`.
Instead, can we do something slightly different:
Instead of modifying the `Make` overload that accepts `const
std::vector<fs::FileInfo>&` can we add a new overload that takes `const
std::vector<fs::FileSelector>&`?
Then, instead of adding `FileType::Glob` can we add an `is_glob` property to
`FileSelector` (if this is true then `base_dir` is expected to be a glob
pattern and `recursive` will be ignored).
We could also then change the rules a little on `FileSelector` so that
`base_dir` is allowed to represent a single file. Then we can treat the "path"
case (user doesn't specify it as a file or a directory) as a selector also.
That would allow us to move all the `GetFileInfo` calls out of the Substrait
consumer.
CC @pitrou for a second opinion as he has a lot of experience here.
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const
PlatformFilename& dir_path)
return results;
}
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t
delimiter) {
Review comment:
```suggestion
std::vector<NativePathString> SplitStr(const NativePathString& str, wchar_t
delimiter) {
```
##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -663,19 +664,26 @@ std::wstring PathWithoutTrailingSlash(const
PlatformFilename& fn) {
return path;
}
-Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename&
dir_path) {
+Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename&
dir_path,
+ bool is_glob = false) {
WIN32_FIND_DATAW find_data;
- std::wstring pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+ std::wstring pattern;
+ if (is_glob)
+ pattern = dir_path.ToNative();
+ else
+ pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+
+ std::vector<WIN32_FIND_DATAW> results;
HANDLE handle = FindFirstFileW(pattern.c_str(), &find_data);
if (handle == INVALID_HANDLE_VALUE) {
+ if (is_glob) return results;
Review comment:
Why is this ok? Can you add a comment explaining what is going on here?
##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -134,8 +135,26 @@ Result<std::shared_ptr<DatasetFactory>>
FileSystemDatasetFactory::Make(
Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
std::shared_ptr<fs::FileSystem> filesystem, const
std::vector<fs::FileInfo>& files,
std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options) {
+ // Discover files in directories and globs
+ std::vector<fs::FileInfo> discovered_files;
+ for (const auto& file : files) {
+ if (file.IsDirectory()) {
+ fs::FileSelector file_selector;
+ file_selector.base_dir = file.dir_name();
+ file_selector.recursive = true;
+ ARROW_ASSIGN_OR_RAISE(auto folder_files,
filesystem->GetFileInfo(file_selector));
+ std::move(folder_files.begin(), folder_files.end(),
+ std::back_inserter(discovered_files));
+ } else if (file.IsGlob()) {
+ ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfoGlob(file));
+ std::move(files.begin(), files.end(),
std::back_inserter(discovered_files));
+ } else if (file.IsFile()) {
+ discovered_files.emplace_back(file);
+ }
+ }
+
Review comment:
So this is something of an abstract concern but I think we have a bit of
ambiguity between input & output types.
I see `FileInfo` as an "output type". It is something you get back from a
filesystem that describes something that it went out and discovered or
inspected.
In the meantime, PlatformFilename and FileSelector are "input" types. They
are provided by the user for the purpose of obtaining FileInfo objects.
As a result I'm not sure it makes sense to have `FileType::Glob` because I
would interpret that as the filesystem discovering a glob inside the
filesystem. For example `GetFileInfo` would never return `FileType::Glob`.
Instead, can we do something slightly different:
Instead of modifying the `Make` overload that accepts `const
std::vector<fs::FileInfo>&` can we add a new overload that takes `const
std::vector<fs::FileSelector>&`?
Then, instead of adding `FileType::Glob` can we add an `is_glob` property to
`FileSelector` (if this is true then `base_dir` is expected to be a glob
pattern and `recursive` will be ignored).
We could also then change the rules a little on `FileSelector` so that
`base_dir` is allowed to represent a single file. Then we can treat the "path"
case (user doesn't specify it as a file or a directory) as a selector also.
That would allow us to move all the `GetFileInfo` calls out of the Substrait
consumer.
CC @pitrou for a second opinion as he has a lot of experience here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]