[ 
https://issues.apache.org/jira/browse/HADOOP-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344385#comment-16344385
 ] 

Steve Loughran edited comment on HADOOP-15191 at 1/30/18 2:00 AM:
------------------------------------------------------------------

h2. Proposed

* New interface {{org.apache.hadoop.fs.store.BulkIO}}
* S3A to implement this, relaying to {{S3ABulkOperations}}
* {{S3ABulkOperations}} to implement an optimised delete

If you look at the cost of the delete(file), it's not just the DELETE call its:

# getFileStatus(file) : HEAD, [HEAD], [LIST].
# DELETE
# getFileStatus(file.parent) HEAD, HEAD, LIST.
# if not found, PUT file.parent + "/"


FWIW, we could maybe optimise that second getFileStatus in the assumption that 
there's no file or dir marker there; all you need to do is check for the LIST 
call returning 1+ entry.

Anyway. you are looking at ~7 HTTP requests per delete. 

Optimising that directory creation is equally important. Now, we could just 
have the bulk IO operation say "outcome of empty directories is undefined". I'm 
happy with that, but it's more of a change to the observable outcome of a 
distcp call.

New {{S3ABulkOperations.bulkDeleteFiles}}

* No check for a file existing before delete
* Issues a bulk delete with the configured page size
* builds up a tree of parent paths, and only attempts to creates fake 
directories for the parent directories at the bottom of the tree.

That is, if you delete the paths
{code}
/A/B.txt
/A/C/D.txt
/A/C/E.txt
{code}

Then the only directory to consider creating is /A/C/; after which you know 
that the parent /A path will have an entry, so doesn't need
any work. The number of fake directory creation therefore goes from O(files) to 
O(leaves in directory tree). At best,  Ω(1), at worst O(files).

One caveat: we now create an empty dir even if the source file doesn't exist.


h2. Testing

I've made the page size configurable (fs.s3a.experimental.bulkdelete.pagesize). 
We can switch on the paged delete mode with a very small page size, and so 
check it works properly even for a small number of files.

New unit test suite {{TestS3ABulkOperations}}, primarily checks tree logic for 
the directory creation process.

New integration test suite {{ITestS3ABulkOperations}} performs bulk IO and sees 
what it does.

The existing {{AbstractContractDistCpTest}} test extends its 
{{deepDirectoryStructureToRemote}} test to become 
{{deepDirectoryStructureToRemoteWithSync}}, 
doing an update with some files added, some removed, and assertions about the 
final state. This verifies that distcp is happy. I've also reviewed the logs
to see that all is well there.

h2. Alternate Design: publish summary and do it independently

The other tactic for doing this would be to not integrate DistCP with the bulk 
delete, and
instead have it publish the files of input & output for a followup reconciler.

Good: 

* No changes to DistCP delete process
* No need to add any explicit API/interface in hadoop-common

Bad:

* New visible option to distcp to save output
* May lead to expectations of future maintenance of the option
* and also a persistent format for the data

You'd still need to add the bulk delete calls alongside the S3A Fs, and any 
other stores to which the bulk IO was also added (Wasb could save on directory 
setup, by the look of things, as would oss: and swift




was (Author: [email protected]):
h2. Proposed

* New interface {{org.apache.hadoop.fs.store.BulkIO}}
* S3A to implement this, relaying to {{S3ABulkOperations}}
* {{S3ABulkOperations}} to implement an optimised delete

If you look at the cost of the delete(file), it's not just the DELETE call its:

# getFileStatus(file) : HEAD, [HEAD], [LIST].
# DELETE
# getFileStatus(file.parent) HEAD, HEAD, LIST.
# if not found, PUT file.parent + "/"


FWIW, we could maybe optimise that second getFileStatus in the assumption that 
there's no file or dir marker there; all you need to do is check for the LIST 
call returning 1+ entry.

Anyway. you are looking at ~7 HTTP requests per delete. 

Optimising that directory creation is equally important. Now, we could just 
have the bulk IO operation say "outcome of empty directories is undefined". I'm 
happy with that, but it's more of a change to the observable outcome of a 
distcp call.

New {{S3ABulkOperations.bulkDeleteFiles}}

* No check for a file existing before delete
* Issues a bulk delete with the configured page size
* builds up a tree of parent paths, and only attempts to creates fake 
directories for the parent directories at the bottom of the tree.

That is, if you delete the paths
{code}
/A/B.txt
/A/C/D.txt
/A/C/E.txt
{code}

Then the only directory to consider creating is /A/C/; after which you know 
that the parent /A path will have an entry, so doesn't need
any work. The number of fake directory creation therefore goes from O(files) to 
O(leaves in directory tree). At best,  Ω(1), at worst O(files).

One caveat: we now create an empty dir even if the source file doesn't exist.


h2. Testing

I've made the page size configurable (fs.s3a.experimental.bulkdelete.pagesize). 
We can switch on the paged delete mode with a very small page size, and so 
check it works properly even for a small number of files.

New unit test suite {{TestS3ABulkOperations}}, primarily checks tree logic for 
the directory creation process.

New integration test suite {{ITestS3ABulkOperations}} performs bulk IO and sees 
what it does.

The existing {{AbstractContractDistCpTest}} test extends its 
{{deepDirectoryStructureToRemote}} test to become 
{{deepDirectoryStructureToRemoteWithSync}}, 
doing an update with some files added, some removed, and assertions about the 
final state. This verifies that distcp is happy. I've also reviewed the logs
to see that all is well there.

h2. Alternate Design: publish summary and do it independently

The other tactic for doing this would be to not integrate DistCP with the bulk 
delete, and
instead have it publish the files of input & output for a followup reconciler.

Good: 

* No changes to DistCP delete process
* No need to add any explicit API/interface in hadoop-common

Bad:

* New visible option to distcp to save output
* May lead to expectations of future maintenance of the option
* and also a persistent format for the data

You'd still need to add the bulk delete calls alongside the S3A Fs, and any 
other stores to which the bulk IO was also added (Wasb could save on directory 
setup, by the look of things, as would oss: and swift:)



> Add Private/Unstable BulkDelete operations to supporting object stores for 
> DistCP
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-15191
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15191
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3, tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15191-001.patch, HADOOP-15191-002.patch
>
>
> Large scale DistCP with the -delete option doesn't finish in a viable time 
> because of the final CopyCommitter doing a 1 by 1 delete of all missing 
> files. This isn't randomized (the list is sorted), and it's throttled by AWS.
> If bulk deletion of files was exposed as an API, distCP would do 1/1000 of 
> the REST calls, so not get throttled.
> Proposed: add an initially private/unstable interface for stores, 
> {{BulkDelete}} which declares a page size and offers a 
> {{bulkDelete(List<Path>)}} operation for the bulk deletion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to