[
https://issues.apache.org/jira/browse/HADOOP-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183473#comment-17183473
]
Steve Jacobs commented on HADOOP-13230:
---------------------------------------
Catching up on this as it's been a while.
Compatibility with external tools adding data to a bucket, such as with awscli
or rclone, is the primary motivation for the original ticketThese tools don't
know anything about the empty directory markers that S3a uses. The issue was
not the check upon deletion, it was checking for the marker without checking to
see if the directory was NOT empty upon reading.
My suggestion was to replace the HEAD request for the fakedir check (to detect
if the directory is empty) with a listobjects call with the path prefix
(perhaps with a limit of 2 objects). And if more than than one object is
returned, or if 1 object is returned but it is not the 'fakedir' object,
determine that the directory is not empty. Which would be the same number of
api requests as the head check itself.
The issue I encountered was while using Facebook's presto query engine, and
'fakedir' objects not being deleted when inserting objects into an 'empty'
partition in hive (or so I thought). It turns out that the issue wasn't in
presto but with 3rd party object storage system we were using. Link to my issue
with the presto team here:
[https://github.com/prestodb/presto/issues/11076]
Turns out the behavior was a bug where the DELETE all was not being processed
in our on-premise system, due to a bug with their handling of a call that
deletes more than one object at a time. Which has since been fixed. It affected
s3a operations from hive as well.
> S3A to optionally retain directory markers
> ------------------------------------------
>
> Key: HADOOP-13230
> URL: https://issues.apache.org/jira/browse/HADOOP-13230
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.9.0
> Reporter: Aaron Fabbri
> Assignee: Steve Loughran
> Priority: Major
> Fix For: 3.3.1
>
>
> Users of s3a may not realize that, in some cases, it does not interoperate
> well with other s3 tools, such as the AWS CLI. (See HIVE-13778, IMPALA-3558).
> Specifically, if a user:
> - Creates an empty directory with hadoop fs -mkdir s3a://bucket/path
> - Copies data into that directory via another tool, i.e. aws cli.
> - Tries to access the data in that directory with any Hadoop software.
> Then the last step fails because the fake empty directory blob that s3a wrote
> in the first step, causes s3a (listStatus() etc.) to continue to treat that
> directory as empty, even though the second step was supposed to populate the
> directory with data.
> I wanted to document this fact for users. We may mark this as not-fix, "by
> design".. May also be interesting to brainstorm solutions and/or a config
> option to change the behavior if folks care.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]