[
https://issues.apache.org/jira/browse/HADOOP-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886155#comment-16886155
]
Steve Loughran commented on HADOOP-16430:
-----------------------------------------
This is hard to do efficiently without that same tracking (or more) of the
rename call.
Specifically: currently a delete(path) will update the parent dir. But if we
know the delete is taking place under a directory tree, then the only
directories which we know are being deleted are those which are not under the
destination directory.
This leads to a design of
# create ancestor state context with dest dir = base dir of delete
# add a delete(list<Path> paths, state) operation
and then for all entries in the path list
* Put a tombstone if they are a file
* Directory (how to check?): also put a tombstone
* Only update the parent dir if the path being changed is that of the base
directory of the delete
* At the end of the delete, finish off by making sure there's a tombstone in
each dir and no child which isn't.
Like I warned: more complex. Not much smaller than that of rename. In fact, the
operation is essentially that of Metastore.move() as used in the
ProgressiveRenameTracker.
> S3AFilesystem.delete to incrementally update s3guard with deletions
> -------------------------------------------------------------------
>
> Key: HADOOP-16430
> URL: https://issues.apache.org/jira/browse/HADOOP-16430
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Reporter: Steve Loughran
> Priority: Major
>
> Currently S3AFilesystem.delete() only updates the delete at the end of a
> paged delete operation. This makes it slow when there are many thousands of
> files to delete ,and increases the window of vulnerability to failures
> Preferred
> * after every bulk DELETE call is issued to S3, queue the (async) delete of
> all entries in that post.
> * at the end of the delete, await the completion of these operations.
> * inside S3AFS, also do the delete across threads, so that different HTTPS
> connections can be used.
> This should maximise DDB throughput against tables which aren't IO limited.
> When executed against small IOP limited tables, the parallel DDB DELETE
> batches will trigger a lot of throttling events; we should make sure these
> aren't going to trigger failures
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]