[
https://issues.apache.org/jira/browse/HADOOP-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296193#comment-15296193
]
Steve Loughran commented on HADOOP-13164:
-----------------------------------------
The goal of the call is to eliminate upstream pseudo-directory blobs. I fear
removing it would do bad things.
But if it is called after every file is written, it will be expensive,
especially as there is {{getStatus()}} in there (2 x {{getObjectMetadata()}} +
1 x {{listObjects()}}) , plus the {{deleteObjects()}} call. As this goes up the
tree, the cost will be O(depth)
Given that after a file has just been written, it is known that there is a
child of any directory (i.e. it is non-empty), then you don't need to check so
much. You look for the existence of a path, and if there: delete.
More deviously, you could say "delete the path without checking to see if it
exists". If it's not there, a failed delete is harmless. That'd still be
O(depth), but one S3 call, rather than 3 or 4.
And, once you go down that path, you could say "queue up a delete for all
parent paths and fire them off in one go", going from O(depth) to O(1).
Even better, you could maybe even do that asynchronously. I'd worry a bit there
about race conditions between the current thread and process, but given this is
just a cleanup, it might be safe —and I don't see it being any worse race-wise
than what exists today, except now it may be more visible to a single thread.
That would need very, very, careful testing. The one thing nobody wants is an
over-zealous delete operation to lose data.
> Optimize S3AFileSystem::deleteUnnecessaryFakeDirectories
> --------------------------------------------------------
>
> Key: HADOOP-13164
> URL: https://issues.apache.org/jira/browse/HADOOP-13164
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 2.8.0
> Reporter: Rajesh Balamohan
> Priority: Minor
>
> https://github.com/apache/hadoop/blob/27c4e90efce04e1b1302f668b5eb22412e00d033/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1224
> deleteUnnecessaryFakeDirectories is invoked in S3AFileSystem during rename
> and on outputstream close() to purge any fake directories. Depending on the
> nesting in the folder structure, it might take a lot longer time as it
> invokes getFileStatus multiple times. Instead, it should be able to break
> out of the loop once a non-empty directory is encountered.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]