pitrou commented on code in PR #41822:
URL: https://github.com/apache/arrow/pull/41822#discussion_r1624508046
##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -2871,6 +2872,12 @@ Status S3FileSystem::CreateDir(const std::string& s,
bool recursive) {
for (const auto& part : path.key_parts) {
parent_key += part;
parent_key += kSep;
+ if (options().check_directory_existence_before_creation) {
Review Comment:
What I'm proposing is (example in the case of `a/b/c`):
1. check which ancestors exist by walking _up_ the directory tree: first
`a/b/c` then `a/b`... until you find the first existing ancestor
2. create the missing descendents by walking _down_ the directory tree: for
example, if you just found that `a` exists, create `a/b` then `a/b/c`
The idea is that, most of the time, almost the entire directory chain will
exist, especially if your workload is writing into a deeply partitioned
dataset. So doing the directory checks from leaf to root should issue less
requests and have less latency.
More formally, let's call _n_ the depth of the path given to `CreateDir`,
and _m_ the number of directories missing along that path.
* with the current approach, we're calling `HeadObject` _O(n)_ times and
`PutObject` _O(m)_ times;
* with my proposal, we're calling `HeadObject` _O(m)_ times and `PutObject`
_O(m)_ times.
Assuming that _m_ is on average much smaller than _n_ (_m_ would be 0 or 1
most of the time), the approach I'm proposing should be much more efficient,
and it would never be less efficient anyway.
Does that make sense or am I missing something?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]