[ 
https://issues.apache.org/jira/browse/HADOOP-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17602403#comment-17602403
 ] 

Daniel Carl Jones commented on HADOOP-17881:
--------------------------------------------

At the moment, I think we're pulling the keys to be deleted from an iterator... 
If we can start shuffling the keys up across the deletes, then we can avoid 
'rolling' over the shard/partition and put less mutations to the 
shard/partitions at one time.

In the perfect world, we know the full list of keys to be deleted and we can 
shuffle all the keys and distribute that heat over all the DeleteObjects API 
calls.

> S3A DeleteOperation to parallelize POSTing of bulk deletes
> ----------------------------------------------------------
>
>                 Key: HADOOP-17881
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17881
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.0
>            Reporter: Steve Loughran
>            Priority: Major
>
> Once the need to update the DDB tables is removed, we can't go from a single 
> POSTed delete at a time to posting a large set of bulk delete operations in 
> parallel.
> The current design is to support incremental update of S3Guard tables, 
> including handling partial failures. Not a problem anymore.
> This will significantly improve delete() performance on directory trees with 
> many many children/descendants, as it goes from a sequence of children/1000 
> POSTs to parallel writes. As each file deleted is still throttled, we will be 
> limited to 3500 deletes/second with throttling, so throwing a large pool of 
> workers at the problem would be counter-productive and potentially cause 
> problems for other applications trying to write down the same directory tree. 
> But we can do better than one-POST at a time.
> Proposed
> * if parallel delete is off: no limit
> * parallel delete is on, limit #of parallel to 3000/page-size: you'll never 
> have more updates pending than the write limit of a single shard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to