martin-traverse commented on issue #38618:
URL: https://github.com/apache/arrow/issues/38618#issuecomment-1798062303
I have created a script to reproduce the problem, the script and outputs for
both version 13 and version 14 are below which show the differences. It appears
to be an issue when the parent dir contains explicit sub dir objects.
In version 14, an empty dir is deleted ok and so is a dir containing blobs,
but a dir containing other dirs is not. It only seems to be an issue when the
sub dirs are created explicitly - if you put a blob in the dir with extra
slashes in the blob name, the dir and content is deleted fine. Also if you
explicitly delete the problematic sub dirs, then you can delete the parent dir.
I also had a look into the code. It does seem the S3 implementation is
listing dir contents and then deleting batches of objects in
DoDeleteDirContentsAsync(). So I guess either the listing part of the deleting
part isn't working right when the dir contains explicit sub dir objects.
That's as far as I got before breakfast! Script and results are below. Is
this enough for you to find and fix the issue or do you still need the
bisection? Realistically I won't have time to set up all the tooling etc for
that until the weekend.
```
import pyarrow.fs as pa_fs
import uuid
s3_args = {
"region": "<region>",
"access_key": "<access_key_id>",
"secret_key": "<secret_access_key>"
}
s3_bucket = "<bucket>"
s3fs = pa_fs.S3FileSystem(**s3_args)
test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Single dir deleted: [{dir_info.type == pa_fs.FileType.NotFound}]")
test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
with s3fs.open_output_stream(s3_bucket + "/some_blob.dat") as stream:
stream.write(b"Some data")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir with content deleted: [{dir_info.type ==
pa_fs.FileType.NotFound}]")
test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.create_dir(s3_bucket + "/" + test_dir + "/sub_dir")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir with sub dir deleted: [{dir_info.type ==
pa_fs.FileType.NotFound}]")
test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.create_dir(s3_bucket + "/" + test_dir + "/sub_dir")
s3fs.delete_dir(s3_bucket + "/" + test_dir + "/sub_dir")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir deleted after subdir removed explicitly: [{dir_info.type ==
pa_fs.FileType.NotFound}]")
test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
with s3fs.open_output_stream(s3_bucket + "/sub_dir/some_blob.dat") as
stream:
stream.write(b"Some data")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir deleted with blob in implicit subdir: [{dir_info.type ==
pa_fs.FileType.NotFound}]")
test_dir = f"test_{uuid.uuid4()}"
s3fs.create_dir(s3_bucket + "/" + test_dir)
s3fs.create_dir(s3_bucket + "/" + test_dir + "/sub_dir/")
with s3fs.open_output_stream(s3_bucket + "/sub_dir/some_blob.dat") as
stream:
stream.write(b"Some data")
s3fs.delete_dir(s3_bucket + "/" + test_dir)
dir_info = s3fs.get_file_info(s3_bucket + "/" + test_dir)
print(f"Dir deleted with blob in explicit subdir: [{dir_info.type ==
pa_fs.FileType.NotFound}]")
```
Results with PyArrow 13:
```
Single dir deleted: [True]
Dir with content deleted: [True]
Dir with sub dir deleted: [True]
Dir deleted after subdir removed explicitly: [True]
Dir deleted with blob in implicit subdir: [True]
Dir deleted with blob in explicit subdir: [True]
```
Results with PyArrow 14:
```
Single dir deleted: [True]
Dir with content deleted: [True]
Dir with sub dir deleted: [False]
Dir deleted after subdir removed explicitly: [True]
Dir deleted with blob in implicit subdir: [True]
Dir deleted with blob in explicit subdir: [False]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]