Hello.

Thank you for your feedback!!

> In which situation does this make a sizable difference in number of
> requests?

The issue I am addressing does not completely resolve the problem, but
there is also the problem caused by *EnsureParentExists* as described in
[2].

*The 42,129 requests with the type REST.PUT.OBJECT are due to the
implementation of move() and delete(): it currently attempts to [re-create
the parent directory](*
https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/filesystem/s3fs.cc#L2849
)* after a copy or delete - this is because if there is only a single file
in the prefix and we move/delete it, then the prefix will no longer exist.
The workaround as implemented is to create a 0-sized object with the name
of the prefix, ensuring that it still "exists".*

...

*One major thing that worries me [is the EnsureparentExists()](*
https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/filesystem/s3fs.cc#L2521
)* that is called from DeleteDir, DeleteFile and Move methods. In a
versioned bucket, this will repeatedly create empty keys to mimic a
directory.*

> Which "inherent consistency" are we talking about concretely?

I meant that users who use S3 as an object store do not want unnecessary
files to be created due to 0-byte objects.

*$ aws s3 ls s3://bucket/prefix/*
*2023-06-23 11:51:02          0 prefix/01/*

*2023-06-23 11:35:24       1438 prefix/01/file2.json.gz2023-06-23 10:47:18
       819 prefix/01/file3.json.gz*

> I don't know what this would achieve, and it would in itself issue
> additional "unnecessary requests".

You are right. I also think the same thing. This goes beyond the
functionality of a library. If needed, users should be able to combine
functions available through S3FileSystem to optimize their use.

Thank you once again for your feedback. I will proceed with improving this
by adding an option to *S3Options*.

Regards

Hyunseok Seo.


2024년 7월 12일 (금) 오후 7:52, Antoine Pitrou <anto...@python.org>님이 작성:

>
> Hi,
>
> Le 12/07/2024 à 12:21, Hyunseok Seo a écrit :
> >
> > *### Why Maintain Empty Directory Markers?*
> >  From what I understand, object stores like S3 do not have a concept of
> > directories. The motivation behind maintaining these markers could be to
> > manage the object store as if it were a traditional file system.
>
> Also, to maintain compatibility with other filesystem-like abstractions
> over S3.
>
> > *### Issues with Marker Creation*
> > Users who have raised concerns about the creation of empty directory
> > markers cite the following reasons:
> >
> > - **Increase in Unnecessary Requests [2]**: Creating empty directory
> > markers leads to additional S3 requests, which can increase costs and
> > affect performance.
>
> In which situation does this make a sizable difference in number of
> requests?
>
> > - **File System Consistency Issues [1]**: S3 is designed as an object
> > store, and creating empty directory markers can break the inherent
> > consistency of the file system.
>
> Which "inherent consistency" are we talking about concretely?
>
> > *### Proposed Solutions*
> > Issue [1] suggests the following approaches:
> >
> > 1. **Add an Option**: Introduce an option in *S3Options* to control
> whether
> > empty directory markers are created, giving users the choice.
>
> That sounds ok to me.
>
> > 3. **Smarter Directory Creation**: Improve the implementation to check
> for
> > other objects in the same path before creating an empty directory marker.
>
> I don't know what this would achieve, and it would in itself issue
> additional "unnecessary requests".
>
> Regards
>
> Antoine.
>

Reply via email to