[ 
https://issues.apache.org/jira/browse/HADOOP-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349819#comment-16349819
 ] 

Steve Loughran commented on HADOOP-15191:
-----------------------------------------

Trace of a run. I'd expected the missing files to be queued for bulk, which 
they are, but lots of directory deletions kick off too. This means the bulk ops 
aren't needed, and indeed the attempt to be clever there and create parent dirs 
wasted. 
{code}
2018-02-01 21:44:44,756 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(343)) - -delete option is enabled. About to 
remove entries from target that are missing in source
2018-02-01 21:44:46,064 [Thread-124] INFO  tools.SimpleCopyListing 
(SimpleCopyListing.java:printStats(608)) - Paths (files+dirs) cnt = 20; dirCnt 
= 10
2018-02-01 21:44:46,064 [Thread-124] INFO  tools.SimpleCopyListing 
(SimpleCopyListing.java:doBuildListing(402)) - Build file listing completed.
2018-02-01 21:44:46,080 [Thread-124] INFO  tools.DistCp 
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 20
2018-02-01 21:44:46,095 [Thread-124] INFO  tools.DistCp 
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 20
2018-02-01 21:44:46,109 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(385)) - Listing completed in 0:00:01.352
2018-02-01 21:44:46,109 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(405)) - Destination filesystem supports bulk 
deletes, maximum size 2
2018-02-01 21:44:46,390 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(434)) - Deleted directory 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir
 - Missing at source
2018-02-01 21:44:46,390 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/file1
2018-02-01 21:44:46,507 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(434)) - Deleted directory 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir1
 - Missing at source
2018-02-01 21:44:46,508 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir1/file2
2018-02-01 21:44:46,508 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(450)) - Initiating bulk delete of size 2
2018-02-01 21:44:46,512 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:lambda$bulkDeleteFiles$0(157)) - Deleting 2 objects
2018-02-01 21:44:46,596 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:maybeMkParentDirs(228)) - Number of directories to try 
creating: 1
2018-02-01 21:44:46,784 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:maybeMkParentDirs(237)) - Number of created 
directories: 1 
2018-02-01 21:44:46,923 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(434)) - Deleted directory 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2
 - Missing at source
2018-02-01 21:44:48,013 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(434)) - Deleted directory 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3
 - Missing at source
2018-02-01 21:44:48,014 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3/file3
2018-02-01 21:44:48,014 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3/file4
2018-02-01 21:44:48,014 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(450)) - Initiating bulk delete of size 2
2018-02-01 21:44:48,014 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:lambda$bulkDeleteFiles$0(157)) - Deleting 2 objects
2018-02-01 21:44:48,168 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:maybeMkParentDirs(228)) - Number of directories to try 
creating: 1
2018-02-01 21:44:48,461 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:maybeMkParentDirs(237)) - Number of created 
directories: 1 
2018-02-01 21:44:48,461 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3/newfile1
2018-02-01 21:44:48,874 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(434)) - Deleted directory 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4
 - Missing at source
2018-02-01 21:44:48,980 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(434)) - Deleted directory 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4/subDir4
 - Missing at source
2018-02-01 21:44:48,980 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4/subDir4/file4
2018-02-01 21:44:48,980 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(450)) - Initiating bulk delete of size 2
2018-02-01 21:44:48,980 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:lambda$bulkDeleteFiles$0(157)) - Deleting 2 objects
2018-02-01 21:44:49,005 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:maybeMkParentDirs(228)) - Number of directories to try 
creating: 2
2018-02-01 21:44:49,281 [Thread-124] INFO  s3a.S3AFileSystem 
(S3ABulkOperations.java:maybeMkParentDirs(237)) - Number of created 
directories: 1 
2018-02-01 21:44:49,281 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4/subDir4/file5
2018-02-01 21:44:49,282 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(467)) - Initiating final bulk delete of size 1
2018-02-01 21:44:49,426 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(476)) - Deleted from target: 
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir
 entries: files: 7 directories: 6
2018-02-01 21:44:49,427 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(478)) - Time to delete: 0:00:03.317
2018-02-01 21:44:49,427 [Thread-124] INFO  mapred.CopyCommitter 
(CopyCommitter.java:cleanup(179)) - Cleaning up temporary work folder: 
file:/tmp/hadoop/mapred/staging/stevel368042071/.staging/_distcp445682291
2018-02-01 21:44:49,516 [Thread-0] INFO  mapreduce.Job 
(Job.java:monitorAndPrintJob(1658)) - Job job_local237798756_0002 completed 
successfully
{code}

> Add Private/Unstable BulkDelete operations to supporting object stores for 
> DistCP
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-15191
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15191
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3, tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15191-001.patch, HADOOP-15191-002.patch, 
> HADOOP-15191-003.patch, HADOOP-15191-004.patch
>
>
> Large scale DistCP with the -delete option doesn't finish in a viable time 
> because of the final CopyCommitter doing a 1 by 1 delete of all missing 
> files. This isn't randomized (the list is sorted), and it's throttled by AWS.
> If bulk deletion of files was exposed as an API, distCP would do 1/1000 of 
> the REST calls, so not get throttled.
> Proposed: add an initially private/unstable interface for stores, 
> {{BulkDelete}} which declares a page size and offers a 
> {{bulkDelete(List<Path>)}} operation for the bulk deletion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to