[jira] [Commented] (HADOOP-15209) PoC: DistCp to eliminate needless deletion of files under deleted directories

Steve Loughran (JIRA) Tue, 20 Feb 2018 09:36:17 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370328#comment-16370328
 ]


Steve Loughran commented on HADOOP-15209:
-----------------------------------------

Patch 002

* LRU cache
* patch contains the HADOOP-15208 changes as they were both stamping on 
CopyCommitter, and they are both parts of the same problem.

The aim of the delete tracker is to actually eliminate the scale problem; the 
HADOOP-15208 allows the change files to be saved to file. This will let us 
understand better what kind of updates are taking place, and provide a starting 
point if this patch doesn't work, and we need to add some bulk delete operation.

Without that real data, we have to guess whether or not the tracker works, more 
more specifically, what size of cache is good.

There's a hard code 1000 entries in right now; this avoids having to add, 
document and maintain another config point.

Outcome of a test run there of a partition tree of 20 years, each with 12 
months, 30 days and 24 files/day
{code}
2018-02-20 17:09:20,407 [main] INFO  mapred.TestDeletedDirTracker 
(TestDeletedDirTracker.java:deletePaths(151)) - Delete YEAR=9
2018-02-20 17:09:20,420 [main] INFO  mapred.TestDeletedDirTracker 
(TestDeletedDirTracker.java:deletePaths(155)) - After proposing to delete 
182973 paths, 21 directories and 0 files were explicitly deleted from a cache 
DeletedDirTracker{maximum size=1000; current size=1000}
{code}

If I shrink the cache to 10 entries, you get ~250 rmdir, no file deletes
{code}
2018-02-20 17:09:21,539 [main] INFO  mapred.TestDeletedDirTracker 
(TestDeletedDirTracker.java:deletePaths(155)) - After proposing to delete 
182973 paths, 252 directories and 0 files were explicitly deleted from a cache 
DeletedDirTracker{maximum size=10; current size=10}
{code}

This implies the parent year/month dirs are being evicted by all the day/ 
entries: when we get a cache hit on the low level dir, the upper parent dirs 
don't get marked as recent. That could be done at the expense of more cache 
lookups: it would guarantee that while a deep/wide dir was processed, the fact 
the whole tree was deleted is remembered.

Alternatively: we don't add new entries for child dirss, do only the top level 
directories in the cache (I'd still like to remember the last used dir though, 
for the use case of "thousands of files under a dir".

Really this comes down to: what tradeoff to we make w.r.t. calling 
{{fs.delete()}} vs {{path.getParent()}} and related cache lookups. For 
local-cluster HDFS, the delete call is low cost. For HTTP-based object store 
protocols, reducing those delete calls matters the most

> PoC: DistCp to eliminate needless deletion of files under deleted directories
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-15209
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15209
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15209) PoC: DistCp to eliminate needless deletion of files under deleted directories

Reply via email to