westonpace opened a new pull request, #35440: URL: https://github.com/apache/arrow/pull/35440
### Rationale for this change The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories. The only potential regression would be if max_recursion was set to something > 1. For example, if a user had: ``` bucket/foo/bar/<10000 files here> ``` Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match. However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does). ### What changes are included in this PR? The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed. ### Are these changes tested? Yes, there should be no behavior change. All of the existing filesystem tests will test this change. ### Are there any user-facing changes? No, other than (hopefully) better performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
