[
https://issues.apache.org/jira/browse/ARROW-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Francois Saint-Jacques updated ARROW-8884:
------------------------------------------
Description:
Listing files on S3 is slow due to the recursive nature of the algorithm.
The following change modifies the behavior of the S3Result to include all
objects but no "grouping" (directories). This lower dramatically the number of
HTTP calls.
{code:c++}
diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
if (!prefix.empty()) {
req.SetPrefix(ToAwsString(prefix) + kSep);
}
- req.SetDelimiter(Aws::String() + kSep);
+ // req.SetDelimiter(Aws::String() + kSep);
req.SetMaxKeys(kListObjectsMaxKeys);
while (true) {
{code}
The suggested change is to add an option to Selector, e.g.
`no_directory_result` or something like this.
was:
Listing files on S3 is slow due to the recursive nature of the algorithm.
The following change modifies the behavior of the S3Result to include all
objects but no "grouping" (directories). This lower dramatically the number of
HTTP calls.
{code:c++}
diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
if (!prefix.empty()) {
req.SetPrefix(ToAwsString(prefix) + kSep);
}
- req.SetDelimiter(Aws::String() + kSep);
+ // req.SetDelimiter(Aws::String() + kSep);
req.SetMaxKeys(kListObjectsMaxKeys);
while (true) {
{code}
> [C++] Listing files with S3FileSystem is slow
> ---------------------------------------------
>
> Key: ARROW-8884
> URL: https://issues.apache.org/jira/browse/ARROW-8884
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Francois Saint-Jacques
> Priority: Major
> Labels: filesystem
>
> Listing files on S3 is slow due to the recursive nature of the algorithm.
> The following change modifies the behavior of the S3Result to include all
> objects but no "grouping" (directories). This lower dramatically the number
> of HTTP calls.
> {code:c++}
> diff --git a/cpp/src/arrow/filesystem/s3fs.cc
> b/cpp/src/arrow/filesystem/s3fs.cc
> index 70c87f46ec..98a40b17a2 100644
> --- a/cpp/src/arrow/filesystem/s3fs.cc
> +++ b/cpp/src/arrow/filesystem/s3fs.cc
> @@ -986,7 +986,7 @@ class S3FileSystem::Impl {
> if (!prefix.empty()) {
> req.SetPrefix(ToAwsString(prefix) + kSep);
> }
> - req.SetDelimiter(Aws::String() + kSep);
> + // req.SetDelimiter(Aws::String() + kSep);
> req.SetMaxKeys(kListObjectsMaxKeys);
>
> while (true) {
> {code}
> The suggested change is to add an option to Selector, e.g.
> `no_directory_result` or something like this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)