Ah, okay.
Then, I suppose that an approach between 1 and 2 makes some sense to me: add an 
option to disable creating the marker on object deletion/removal. I think this 
alone isn't the best solution but it seems to at least add a mode where 
creating the marker is more controlled.
As an aside, are there docs that describe the arrow file system behaviors so 
it's easier to join in on these conversations and potentially easier to clarify 
why a user may want to use an object store interface (assuming that may be 
added sometime in the future)? The closest I can find is [1] which is most 
useful for very high-level users.

[1]: https://arrow.apache.org/docs/cpp/io.html#filesystems

 Sent from Proton Mail for iOS 
On Tue, Jul 16, 2024 at 07:22, Antoine Pitrou <anto...@python.org> wrote: 
 Hello Aldrin,

It's not either/or, the directory marker is created everytime necessary,
for example when CreateDir() is called.

Regards

Antoine.


Le 15/07/2024 à 19:20, Aldrin a écrit :
> Thanks Antoine!
>
> Preserving the property across multiple clients (and presumably across 
independent sessions of the same client) is the part that I was missing.
>
>  From the link you shared, I saw an aws page discussing the use of folders 
in the s3 console [1]. Their approach is to create the marker on folder 
creation. Instead of adding an `S3Options` property for avoiding marker 
creation on delete, what if it changes the time of marker creation from "on 
delete" to "on creation"? That would seem to align better with tools like S3 
console as well as cyberduck and simplify the overall consensus logic that 
Felipe mentioned as being a potential pitfall for the 3rd proposed solution 
(folder creation should occur far less often than file deletion/move/replace).
>
> I'm not sure if this is already an option (I don't know much about the 
S3Filesystem implementation of Arrow) or was an old option that was changed in 
favor of creating the marker on deletion.
>
>
> [1]: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html
>
>
>
>
>
> # ------------------------------
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Monday, July 15th, 2024 at 07:59, Antoine Pitrou 
<anto...@python.org> wrote:
>
>> No, because these markers also communicate the information to other
>> implementations of S3 abstractions.
>>
>
>> An example of this is: https://docs.cyberduck.io/protocols/s3/#folders
>>
>
>> Regards
>>
>
>> Antoine.
>>
>
>>
>
>> Le 13/07/2024 à 07:15, Aldrin a écrit :
>>
>
>>>> ...then I still expect the directory /foo to exist
>>>
>
>>> Right, but if that is the sole purpose of empty directory markers, 
I'm curious if there was an attempt at keeping track of the 
prefixes/directories locally?
>>>
>
>>> # ------------------------------
>>>
>
>>> # Aldrin
>>>
>
>>> https://github.com/drin/
>>>
>
>>> https://gitlab.com/octalene
>>>
>
>>> https://keybase.io/octalene
>>>
>
>>> On Friday, July 12th, 2024 at 19:44, Hyunseok Seo 
hsseo0...@gmail.com wrote:
>>>
>
>>>> I wonder why S3 (object storage) operates based on file system 
semantics.
>>>> Python users are usually data scientists. They might not be 
familiar with
>>>> the differences between object storage and file storage. 
Furthermore, I
>>>> think there are a lot of pyarrow users.
>>>
>
>>>>> Avoiding file by file operations so that internal 
functions can batch as
>>>>> much as possible.
>>>
>
>>>> Thank you for the detailed explanation. So, are you suggesting 
that a more
>>>> fundamental solution is needed rather than just adding 
options? I thought
>>>> supporting such options would help users who do not want 
markers, despite
>>>> the issues you mentioned. Furthermore, I agree that supporting 
ObjectStore
>>>> is necessary for a more fundamental solution.
>>>
>
>>>> Thank you.
>>>
>
>>>> 2024년 7월 13일 (토) 오전 10:00, Weston Pace weston.p...@gmail.com님이 
작성:
>>>
>
>>>>>> I think my question is still relevant: no matter what 
semantics
>>>>>> `S3FileSystem` is trying to provide, I'm still not 
sure how the placeholder
>>>>>> object helps. I assume it's for listing objects, but 
what else?
>>>
>
>>>>> If I have a local filesystem and I delete a file /foo/bar 
then I still
>>>>> expect the directory /foo to exist.
>>>
>
>>>>> ```
>>>
>
>>>>> mkdir /foo
>>>
>
>>>>> touch /foo/bar
>>>
>
>>>>> rm /foo/bar
>>>
>
>>>>> ls / # should show /foo
>>>
>
>>>>> ```
>>>
>
>>>>> In an object store there is no `mkdir` and, even if I 
remove /foo/bar then
>>>>> there is no guarantee /foo will exist.
>>>
>
>>>>> On Fri, Jul 12, 2024, 2:50 PM Aldrin 
octalene....@pm.me.invalid wrote:
>>>
>
>>>>>> But I think the issue being addressed 1 is 
essentially, "`delete_file`
>>>>>> shouldn't create additional files/directories in S3."
>>>
>
>>>>>> I think discussion about the semantics at large is 
interesting but may be
>>>>>> a digression? Also, I think there are varying degrees 
of "filesystem
>>>>>> semantics" that are even being discussed (the naming 
system and
>>>>>> hierarchical inode structure vs atomicity of 
read/write operations).
>>>
>
>>>>>> I think my question is still relevant: no matter what 
semantics
>>>>>> `S3FileSystem` is trying to provide, I'm still not 
sure how the
>>>>>> placeholder
>>>>>> object helps. I assume it's for listing objects, but 
what else?
>>>
>
>>>>>> # ------------------------------
>>>
>
>>>>>> # Aldrin
>>>
>
>>>>>> https://github.com/drin/
>>>
>
>>>>>> https://gitlab.com/octalene
>>>
>
>>>>>> https://keybase.io/octalene
>>>
>
>>>>>> On Friday, July 12th, 2024 at 14:26, Raphael 
Taylor-Davies
>>>>>> r.taylordav...@googlemail.com.INVALID wrote:
>>>
>
>>>>>>>> Many people
>>>>>>>> are familiar with object stores these days. 
You could create a new
>>>>>>>> abstraction `ObjectStore` which is very 
similar to `FileSystem`
>>>>>>>> except
>>>>>>>> the
>>>>>>>> semantics are object store semantics and not 
filesystem semantics.
>>>
>
>>>>>>> FWIW in the Arrow Rust ecosystem we only provide 
an object store
>>>>>>> abstraction, and this has served us very well. My 
2 cents is that
>>>>>>> object
>>>>>>> store semantics are sufficient, if not superior 1, 
than filesystem
>>>>>>> based interfaces for the vast majority of use 
cases, with the few
>>>>>>> workloads that aren't sufficiently served 
requiring such close
>>>>>>> integration with often OS-specific filesystem APIs 
and behaviours as to
>>>>>>> make building a coherent abstraction extremely 
difficult.
>>>
>
>>>>>>> Iceberg also took a similar approach with its File 
IO abstraction 2.
>>>
>
>>>>>>> 1:
>>>
>
>>>>> 
https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface
>>>
>
>>>>>>> On 12/07/2024 22:05, Weston Pace wrote:
>>>
>
>>>>>>>>> The markers are necessary to offer file 
system semantics on top of
>>>>>>>>> object
>>>>>>>>> stores. You will get a ton of subtle bugs 
otherwise.
>>>>>>>>> Yes, object stores and filesystems are 
different. If you expect
>>>>>>>>> your
>>>>>>>>> filesystem to act like a filesystem then 
these things need to be
>>>>>>>>> done in
>>>>>>>>> order to avoid these bugs.
>>>
>
>>>>>>>> If an option modifies a filesystem to behave 
more like an object
>>>>>>>> store
>>>>>>>> then
>>>>>>>> I don't think it's necessarily a bad thing as 
long as it isn't the
>>>>>>>> default. By turning on the option the user is 
intentionally altering
>>>>>>>> the
>>>>>>>> behavior and should not be making the same 
expectations.
>>>
>
>>>>>>>> On the other hand, there is another approach 
you could take. Many
>>>>>>>> people
>>>>>>>> are familiar with object stores these days. 
You could create a new
>>>>>>>> abstraction `ObjectStore` which is very 
similar to `FileSystem`
>>>>>>>> except
>>>>>>>> the
>>>>>>>> semantics are object store semantics and not 
filesystem semantics. I
>>>>>>>> believe most of our filesystem classes could 
implement both
>>>>>>>> `ObjectStore`
>>>>>>>> and `FileSystem` abstractions without 
significant code duplication.
>>>
>
>>>>>>>> This way, if a user wants filesystem 
semantics, they use a
>>>>>>>> `FileSystem` and
>>>>>>>> they pay the abstraction cost. If a user is 
comfortable with
>>>>>>>> `ObjectStore`
>>>>>>>> semantics they use `ObjectStore` and they 
don't have to pay the
>>>>>>>> costs.
>>>
>
>>>>>>>> This would be more work than just allowing 
options to violate
>>>>>>>> FileSystem
>>>>>>>> guarantees but it would provide a more clear 
distinction between the
>>>>>>>> two.
>>>
>
>>>>>>>> On Fri, Jul 12, 2024 at 9:25 AM Aldrin 
octalene....@pm.me.invalid
>>>>>>>> wrote:
>>>
>
>>>>>>>>> Hello!
>>>
>
>>>>>>>>> This may be naive, but why does the empty 
directory marker need to
>>>>>>>>> exist
>>>>>>>>> on the S3 side at all? If a local 
directory is created (because
>>>>>>>>> filesystem
>>>>>>>>> semantics), then I am not sure why a fake 
object needs to exist on
>>>>>>>>> the
>>>>>>>>> object-store side.
>>>
>
>>>>>>>>> # ------------------------------
>>>
>
>>>>>>>>> # Aldrin
>>>
>
>>>>>>>>> https://github.com/drin/
>>>
>
>>>>>>>>> https://gitlab.com/octalene
>>>
>
>>>>>>>>> https://keybase.io/octalene
>>>
>
>>>>>>>>> On Friday, July 12th, 2024 at 08:35, 
Felipe Oliveira Carvalho <
>>>>>>>>> felipe...@gmail.com> wrote:
>>>
>
>>>>>>>>>> Hi,
>>>
>
>>>>>>>>>> The markers are necessary to offer 
file system semantics on top
>>>>>>>>>> of
>>>>>>>>>> object
>>>>>>>>>> stores. You will get a ton of subtle 
bugs otherwise.
>>>
>
>>>>>>>>>> If instead of arrow::FileSystem, Arrow 
offered an
>>>>>>>>>> arrow::ObjectStore
>>>>>>>>>> interface that wraps local filesystems 
and object stores with
>>>>>>>>>> object-store
>>>>>>>>>> semantics (i.e. no concept of empty 
directory or atomic directory
>>>>>>>>>> deletion), then application developers 
would have more control of
>>>>>>>>>> the
>>>>>>>>>> actions performed on the object store 
they are using. Cons would
>>>>>>>>>> be
>>>>>>>>>> slower
>>>>>>>>>> operations when working with a local 
filesystem and no concept of
>>>>>>>>>> directory.
>>>
>
>>>>>>>>>>> 1. Add an Option: Introduce an 
option in S3Options to control
>>>>>>>>>>> whether empty directory markers 
are created, giving users the
>>>>>>>>>>> choice.
>>>
>
>>>>>>>>>> Then it wouldn't be an honest 
implementation of arrow::FileSystem
>>>>>>>>>> for the
>>>>>>>>>> reasons listed above.
>>>
>
>>>>>>>>>>> Change Default Behavior: Modify 
the default behavior to avoid
>>>>>>>>>>> creating empty directory markers 
when a file is deleted.
>>>
>
>>>>>>>>>> That would bring in the bugs because 
an arrow::FileSystem
>>>>>>>>>> instance
>>>>>>>>>> would
>>>>>>>>>> behave differently depending on what 
is backing it.
>>>
>
>>>>>>>>>>> 3. Smarter Directory Creation: 
Improve the implementation to
>>>>>>>>>>> check
>>>>>>>>>>> for other objects in the same path 
before creating an empty
>>>>>>>>>>> directory
>>>>>>>>>>> marker.
>>>
>
>>>>>>>>>> This might be a problem when more than 
one client or thread is
>>>>>>>>>> mutating
>>>>>>>>>> the
>>>>>>>>>> object store through the 
arrow::FileSystem. You can check now and
>>>>>>>>>> once
>>>>>>>>>> you're done deleting all the other 
files you thought existed are
>>>>>>>>>> deleted
>>>>>>>>>> as
>>>>>>>>>> well. Very likely if clients decide to 
implement parallel
>>>>>>>>>> deletion.
>>>
>
>>>>>>>>>> The existing solution of always 
creating a marker when done is
>>>>>>>>>> not
>>>>>>>>>> perfect
>>>>>>>>>> either, but less likely to break.
>>>
>
>>>>>>>>>> ## Suggested Workaround
>>>
>
>>>>>>>>>> Avoiding file by file operations so 
that internal functions can
>>>>>>>>>> batch as
>>>>>>>>>> much as possible.
>>>
>
>>>>>>>>>> --
>>>>>>>>>> Felipe
>>>
>
>>>>>>>>>> On Fri, Jul 12, 2024 at 7:22 AM 
Hyunseok Seo hsseo0...@gmail.com
>>>>>>>>>> wrote:
>>>
>
>>>>>>>>>>> Hello. community!
>>>
>
>>>>>>>>>>> I am currently working on 
addressing the issue described in
>>>>>>>>>>> [C++]
>>>>>>>>>>> Addoption to not create parent 
directory with S3 delete_file.
>>>>>>>>>>> In
>>>>>>>>>>> this
>>>>>>>>>>> process, I have
>>>>>>>>>>> found it necessary to gather 
feedback on how to best resolve
>>>>>>>>>>> this
>>>>>>>>>>> issue.
>>>>>>>>>>> Below is a summary and some 
questions I have for the community.
>>>
>
>>>>>>>>>>> ### Background
>>>>>>>>>>> Currently, the S3FileSystem 
generates an empty directory marker
>>>>>>>>>>> (by
>>>>>>>>>>> calling the EnsureParentExists 
function) when a file is deleted
>>>>>>>>>>> and the
>>>>>>>>>>> directory becomes empty. This 
behavior maintains the appearance
>>>>>>>>>>> of the
>>>>>>>>>>> directory structure. However, 
there have been issues raised by
>>>>>>>>>>> users
>>>>>>>>>>> regarding this behavior in issues 
1.
>>>
>
>>>>>>>>>>> ### Why Maintain Empty Directory 
Markers?
>>>>>>>>>>>  From what I understand, object 
stores like S3 do not have a
>>>>>>>>>>> concept of
>>>>>>>>>>> directories. The motivation behind 
maintaining these markers
>>>>>>>>>>> could be
>>>>>>>>>>> to
>>>>>>>>>>> manage the object store as if it 
were a traditional file
>>>>>>>>>>> system.
>>>>>>>>>>> If
>>>>>>>>>>> anyone
>>>>>>>>>>> knows the context behind the 
implementation of S3FileSystem, it
>>>>>>>>>>> would
>>>>>>>>>>> be
>>>>>>>>>>> great if you could share it.
>>>
>
>>>>>>>>>>> ### Issues with Marker Creation
>>>>>>>>>>> Users who have raised concerns 
about the creation of empty
>>>>>>>>>>> directory
>>>>>>>>>>> markers cite the following reasons:
>>>
>
>>>>>>>>>>> - Increase in Unnecessary Requests 
2: Creating empty directory
>>>>>>>>>>> markers leads to additional S3 
requests, which can increase
>>>>>>>>>>> costs and
>>>>>>>>>>> affect performance.
>>>>>>>>>>> - File System Consistency Issues 
1: S3 is designed as an object
>>>>>>>>>>> store, and creating empty 
directory markers can break the
>>>>>>>>>>> inherent
>>>>>>>>>>> consistency of the file system.
>>>
>
>>>>>>>>>>> ### Proposed Solutions
>>>>>>>>>>> Issue 1 suggests the following 
approaches:
>>>
>
>>>>>>>>>>> 1. Add an Option: Introduce an 
option in S3Options to control
>>>>>>>>>>> whether
>>>>>>>>>>> empty directory markers are 
created, giving users the choice.
>>>>>>>>>>> 2. Change Default Behavior: Modify 
the default behavior to
>>>>>>>>>>> avoid
>>>>>>>>>>> creating empty directory markers 
when a file is deleted.
>>>>>>>>>>> 3. Smarter Directory Creation: 
Improve the implementation to
>>>>>>>>>>> check for
>>>>>>>>>>> other objects in the same path 
before creating an empty
>>>>>>>>>>> directory
>>>>>>>>>>> marker.
>>>>>>>>>>> Here is my personal thought 
(approach 1 + 3):
>>>
>
>>>>>>>>>>> (approach 1) I believe it would be 
best to add the Marker as an
>>>>>>>>>>> option
>>>>>>>>>>> (as some users might not want this 
enhancement).
>>>
>
>>>>>>>>>>> (approach 3) When the option is 
enabled, if there are no files
>>>>>>>>>>> (objects)
>>>>>>>>>>> in the path (prefix) corresponding 
to a directory based on the
>>>>>>>>>>> file
>>>>>>>>>>> system
>>>>>>>>>>> concept, we should maintain the 
Marker. Otherwise, we should
>>>>>>>>>>> check the
>>>>>>>>>>> number of files in the same path 
and avoid calling
>>>>>>>>>>> EnsureParentExists
>>>>>>>>>>> if
>>>>>>>>>>> there are two or more files.
>>>
>
>>>>>>>>>>> On the other hand, I also feel 
that this approach might make
>>>>>>>>>>> the
>>>>>>>>>>> logic
>>>>>>>>>>> more
>>>>>>>>>>> complicated.
>>>
>
>>>>>>>>>>> ### We Would Like Your Feedback
>>>>>>>>>>> - What are your thoughts on the 
creation of empty directory
>>>>>>>>>>> markers?
>>>>>>>>>>> - Which of the proposed solutions 
do you prefer?
>>>>>>>>>>> - Do you have any additional 
suggestions or comments?
>>>
>
>>>>>>>>>>> We appreciate your valuable 
feedback and aim to find the best
>>>>>>>>>>> solution
>>>>>>>>>>> based on your input.
>>>
>
>>>>>>>>>>> Thank you.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to