westonpace opened a new pull request, #35440:
URL: https://github.com/apache/arrow/pull/35440

   ### Rationale for this change
   
   The old model of "walk"ing the directory could lead to a large number of 
calls.  If someone is fully listing a bucket they will need to make one S3 API 
call for every single directory in the bucket.  With this approach there is 
only 1 call made for every 1000 files, regardless of how they are spread across 
directories.
   
   The only potential regression would be if max_recursion was set to something 
> 1.  For example, if a user had:
   
   ```
   bucket/foo/bar/<10000 files here>
   ```
   
   Then if they make a request for `bucket` with `max_recursion=2` the new 
approach will list all 10,000 files and then eliminate the files that don't 
match.
   
   However, I believe these cases (using max_recursion) to be rarer and less 
common than the typical case of listing all files (which dataset discovery 
does).
   
   ### What changes are included in this PR?
   
   The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has 
changed.
   
   ### Are these changes tested?
   
   Yes, there should be no behavior change.  All of the existing filesystem 
tests will test this change.
   
   ### Are there any user-facing changes?
   
   No, other than (hopefully) better performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to