[jira] [Comment Edited] (HADOOP-14124) S3AFileSystem silently deletes "fake" directories when writing a file.

Joel Baranick (JIRA) Mon, 27 Feb 2017 15:29:50 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886734#comment-15886734
 ]


Joel Baranick edited comment on HADOOP-14124 at 2/27/17 11:27 PM:
------------------------------------------------------------------

Hey Steve,

Thanks for the info.  I read the Hadoop Filesystem specification and it seems 
like this scenario is breaking some of the specification.

First, the postcondition of the specification for {{FSDataOutputStream 
create(Path, ...)}} states that the "... updated (valid) FileSystem must 
contains all the parent directories of the path, as created by 
mkdirs(parent(p)).".  I would content that in this scenario, the opposite is 
happening.

Second, the "Empty (non-root) directory" postcondition of the specification for 
{{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty 
directory that is not root will remove the path from the FS and return true.".  
While this is occurring, I think that considering the a fake directory empty 
even if it has another fake directory in it is incorrect.  For example, on 
debian, the following doesn't work. 
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}

Additionally, the interaction of AmazonS3Client/CyberDuck with empty 
directories seems different than you described.  See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and 
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}.  Result: 
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+.  Result: _Size = 0B_ and 
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: 
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: 
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}.  Result: 
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^

One thing to note above is the inconsistent results for 
{{S3AFileSystem.listStatus(...))}}.  In some cases a folder will be indicated 
as empty and in other it will be not empty. 

At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS 
Console, the {{/job}} and {{/job/task}} folders continue to exists and all 
calls continue to return the same results as before (except {{/job/task/file}} 
is excluded from any list results).  If, on the other hand, you created 
{{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent 
folders which it considers "empty".  Then when {{/job/task/file}} is deleted, 
the parent "empty" directories are also gone.

My last counterpoint to the current Hadoop behavior with regard to S3A is the 
AWS S3 Console.  It effectively models a filesystem despite the fact that it is 
backed by a blobstore.  I'm able to create nested folders, upload a file, 
delete the file, and the nested "empty" folders still exist.  As to the 
consistency guarantees, this is solved by EMR, making even more like a true 
FileSystem.

Regarding HADOOP-9565, I don't have any need or desire for that.  I would 
prefer that everything continues to function under the FileSystem paradigm.  
The Posix API is good for me and consistency is a non-issue because of EMR, 
s3mper, or HADOOP-13345 (which seems to be based on the ideas from s3mper).

Thanks!


was (Author: jbaranick):
Hey Steve,

Thanks for the info.  I read the Hadoop Filesystem specification and it seems 
like this scenario is breaking some of the specification.

First, the postcondition of the specification for {{FSDataOutputStream 
create(Path, ...)}} states that the "... updated (valid) FileSystem must 
contains all the parent directories of the path, as created by 
mkdirs(parent(p)).".  I would content that in this scenario, the opposite is 
happening.

Second, the "Empty (non-root) directory" postcondition of the specification for 
{{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty 
directory that is not root will remove the path from the FS and return true.".  
While this is occurring, I think that considering the a fake directory empty 
even if it has another fake directory in it is incorrect.  For example, on 
debian, the following doesn't work. 
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}

Additionally, the interaction of AmazonS3Client/CyberDuck with empty 
directories seems different than you described.  See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and 
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}.  Result: 
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+.  Result: _Size = 0B_ and 
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: 
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: 
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}.  Result: 
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^

One thing to note above is the inconsistent results for 
{{S3AFileSystem.listStatus(...))}}.  In some cases a folder will be indicated 
as empty and in other it will be not empty. 

At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS 
Console, the {{/job}} and {{/job/task}} folders continue to exists and all 
calls continue to return the same results as before (except {{/job/task/file}} 
is excluded from any list results).  If, on the other hand, you created 
{{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent 
folders which it considers "empty".  Then when {{/job/task/file}} is deleted, 
the parent "empty" directories are also gone.

My last counterpoint to the current Hadoop behavior with regard to S3A is the 
AWS S3 Console.  It effectively models a filesystem despite the fact that it is 
backed by a blobstore.  I'm able to create nested folders, upload a file, 
delete the file, and the nested "empty" folders still exist.  As to the 
consistency guarantees, this is solved by EMR, making even more like a true 
FileSystem.

Regarding HADOOP-9565, I don't have any need or desire for that.  I would 
prefer that everything continues to function under the FileSystem paradigm.  
The Posix API is good for me and consistency is a non-issue because of EMR.

Thanks!

> S3AFileSystem silently deletes "fake" directories when writing a file.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-14124
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14124
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs, fs/s3
>    Affects Versions: 2.6.0
>            Reporter: Joel Baranick
>              Labels: filesystem, s3
>
> I realize that you guys probably have a good reason for {{S3AFileSystem}} to 
> cleanup "fake" folders when a file is written to S3.  That said, that fact 
> that it silently does this feels like a separation of concerns issue.  It 
> also leads to weird behavior issues where calls to 
> {{AmazonS3Client.getObjectMetadata}} for folders work before calling 
> {{S3AFileSystem.create}} but not after.  Also, there seems to be no mention 
> in the javadoc that the {{deleteUnnecessaryFakeDirectories}} method is 
> automatically invoked. Lastly, it seems like the goal of {{FileSystem}} 
> should be to ensure that code built on top of it is portable to different 
> implementations.  This behavior is an example of a case where this can break 
> down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-14124) S3AFileSystem silently deletes "fake" directories when writing a file.

Reply via email to