[ 
https://issues.apache.org/jira/browse/HADOOP-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350727#comment-16350727
 ] 

Steve Loughran commented on HADOOP-15191:
-----------------------------------------

Aaron, thanks for the comments.

w.r.t  directories vs files, in a bulk S3 delete  we can't check each path up 
front for being a directory, so if you start deleting paths which aren't there, 
refer to dirs, etc, things get confused. The patch as is gets S3guard into 
trouble if you hand it a directory on the list.

I'm currently thinking "do I need to do this at all", based on those traces 
which show that the file list for distcp is including all files under deleted 
directory trees. If we eliminate that waste of effort, then we may not need 
this new API at all

Good: no changes to filesystems, speedup everywhere
Danger: I'd need to build up a datastructure in the distcp copy committer, one 
which, if it goes OOM, breaks distcp workflows and leaves people who can phone 
me up unhappy.

I'm thinking of: 
binary tree of Path.hashCode() of all deleted directories; you look for the 
parent dir before deleteing a file, for a dir you then add yourself to the hash 
whether you are executed or not

Avoids keeping all the Path  structures around, needs an object with a long and 
two pointers per ref, O(lg(directories)) on lookup/insert, and we could make 
the directory check combine the lookup and the insert

I'll file a separate JIRA on there, again, reviews appreciated. Lets see how 
far that one can get before worrying about bulk deletion, which will only 
benefit for the case of: directories retained but some/many/all files removed 
from them. A feature whose need will become more apparent if the next patch 
logs information about files vs dirs deleted


> Add Private/Unstable BulkDelete operations to supporting object stores for 
> DistCP
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-15191
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15191
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3, tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15191-001.patch, HADOOP-15191-002.patch, 
> HADOOP-15191-003.patch, HADOOP-15191-004.patch
>
>
> Large scale DistCP with the -delete option doesn't finish in a viable time 
> because of the final CopyCommitter doing a 1 by 1 delete of all missing 
> files. This isn't randomized (the list is sorted), and it's throttled by AWS.
> If bulk deletion of files was exposed as an API, distCP would do 1/1000 of 
> the REST calls, so not get throttled.
> Proposed: add an initially private/unstable interface for stores, 
> {{BulkDelete}} which declares a page size and offers a 
> {{bulkDelete(List<Path>)}} operation for the bulk deletion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to