[ 
https://issues.apache.org/jira/browse/HADOOP-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296193#comment-15296193
 ] 

Steve Loughran commented on HADOOP-13164:
-----------------------------------------

The goal of the call is to eliminate upstream pseudo-directory blobs. I fear 
removing it would do bad things.

But if it is called after every file is written, it will be expensive, 
especially as there is {{getStatus()}} in there (2 x {{getObjectMetadata()}} + 
1 x {{listObjects()}}) , plus the {{deleteObjects()}} call. As this goes up the 
tree, the cost will be O(depth)

Given that after a file has just been written, it is known that there is a 
child of any directory (i.e. it is non-empty), then you don't need to check so 
much. You look for the existence of a path, and if there: delete. 

More deviously, you could say "delete the path without checking to see if it 
exists". If it's not there, a failed delete is harmless. That'd still be 
O(depth), but one S3 call, rather than 3 or 4.

And, once you go down that path, you could say "queue up a delete for all 
parent paths and fire them off in one go", going from O(depth) to O(1). 

Even better, you could maybe even do that asynchronously. I'd worry a bit there 
about race conditions between the current thread and process, but given this is 
just a cleanup, it might be safe —and I don't see it being any worse race-wise 
than what exists today, except now it may be more visible to a single thread.

That would need very, very, careful testing. The one thing nobody wants is an 
over-zealous delete operation to lose data.



> Optimize S3AFileSystem::deleteUnnecessaryFakeDirectories
> --------------------------------------------------------
>
>                 Key: HADOOP-13164
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13164
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>
> https://github.com/apache/hadoop/blob/27c4e90efce04e1b1302f668b5eb22412e00d033/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1224
> deleteUnnecessaryFakeDirectories is invoked in S3AFileSystem during rename 
> and on outputstream close() to purge any fake directories. Depending on the 
> nesting in the folder structure, it might take a lot longer time as it 
> invokes getFileStatus multiple times.  Instead, it should be able to break 
> out of the loop once a non-empty directory is encountered. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to