[
https://issues.apache.org/jira/browse/HADOOP-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349819#comment-16349819
]
Steve Loughran commented on HADOOP-15191:
-----------------------------------------
Trace of a run. I'd expected the missing files to be queued for bulk, which
they are, but lots of directory deletions kick off too. This means the bulk ops
aren't needed, and indeed the attempt to be clever there and create parent dirs
wasted.
{code}
2018-02-01 21:44:44,756 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(343)) - -delete option is enabled. About to
remove entries from target that are missing in source
2018-02-01 21:44:46,064 [Thread-124] INFO tools.SimpleCopyListing
(SimpleCopyListing.java:printStats(608)) - Paths (files+dirs) cnt = 20; dirCnt
= 10
2018-02-01 21:44:46,064 [Thread-124] INFO tools.SimpleCopyListing
(SimpleCopyListing.java:doBuildListing(402)) - Build file listing completed.
2018-02-01 21:44:46,080 [Thread-124] INFO tools.DistCp
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 20
2018-02-01 21:44:46,095 [Thread-124] INFO tools.DistCp
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 20
2018-02-01 21:44:46,109 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(385)) - Listing completed in 0:00:01.352
2018-02-01 21:44:46,109 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(405)) - Destination filesystem supports bulk
deletes, maximum size 2
2018-02-01 21:44:46,390 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(434)) - Deleted directory
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir
- Missing at source
2018-02-01 21:44:46,390 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/file1
2018-02-01 21:44:46,507 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(434)) - Deleted directory
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir1
- Missing at source
2018-02-01 21:44:46,508 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir1/file2
2018-02-01 21:44:46,508 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(450)) - Initiating bulk delete of size 2
2018-02-01 21:44:46,512 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:lambda$bulkDeleteFiles$0(157)) - Deleting 2 objects
2018-02-01 21:44:46,596 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:maybeMkParentDirs(228)) - Number of directories to try
creating: 1
2018-02-01 21:44:46,784 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:maybeMkParentDirs(237)) - Number of created
directories: 1
2018-02-01 21:44:46,923 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(434)) - Deleted directory
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2
- Missing at source
2018-02-01 21:44:48,013 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(434)) - Deleted directory
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3
- Missing at source
2018-02-01 21:44:48,014 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3/file3
2018-02-01 21:44:48,014 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3/file4
2018-02-01 21:44:48,014 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(450)) - Initiating bulk delete of size 2
2018-02-01 21:44:48,014 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:lambda$bulkDeleteFiles$0(157)) - Deleting 2 objects
2018-02-01 21:44:48,168 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:maybeMkParentDirs(228)) - Number of directories to try
creating: 1
2018-02-01 21:44:48,461 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:maybeMkParentDirs(237)) - Number of created
directories: 1
2018-02-01 21:44:48,461 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir2/subDir3/newfile1
2018-02-01 21:44:48,874 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(434)) - Deleted directory
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4
- Missing at source
2018-02-01 21:44:48,980 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(434)) - Deleted directory
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4/subDir4
- Missing at source
2018-02-01 21:44:48,980 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4/subDir4/file4
2018-02-01 21:44:48,980 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(450)) - Initiating bulk delete of size 2
2018-02-01 21:44:48,980 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:lambda$bulkDeleteFiles$0(157)) - Deleting 2 objects
2018-02-01 21:44:49,005 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:maybeMkParentDirs(228)) - Number of directories to try
creating: 2
2018-02-01 21:44:49,281 [Thread-124] INFO s3a.S3AFileSystem
(S3ABulkOperations.java:maybeMkParentDirs(237)) - Number of created
directories: 1
2018-02-01 21:44:49,281 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(446)) - Queueing for bulk delete file
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir/inputDir/subDir4/subDir4/file5
2018-02-01 21:44:49,282 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(467)) - Initiating final bulk delete of size 1
2018-02-01 21:44:49,426 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(476)) - Deleted from target:
s3a://hwdev-steve-new/test/ITestS3AContractDistCp/deepDirectoryStructureToRemoteWithSync/outputDir
entries: files: 7 directories: 6
2018-02-01 21:44:49,427 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:deleteMissing(478)) - Time to delete: 0:00:03.317
2018-02-01 21:44:49,427 [Thread-124] INFO mapred.CopyCommitter
(CopyCommitter.java:cleanup(179)) - Cleaning up temporary work folder:
file:/tmp/hadoop/mapred/staging/stevel368042071/.staging/_distcp445682291
2018-02-01 21:44:49,516 [Thread-0] INFO mapreduce.Job
(Job.java:monitorAndPrintJob(1658)) - Job job_local237798756_0002 completed
successfully
{code}
> Add Private/Unstable BulkDelete operations to supporting object stores for
> DistCP
> ---------------------------------------------------------------------------------
>
> Key: HADOOP-15191
> URL: https://issues.apache.org/jira/browse/HADOOP-15191
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3, tools/distcp
> Affects Versions: 2.9.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Attachments: HADOOP-15191-001.patch, HADOOP-15191-002.patch,
> HADOOP-15191-003.patch, HADOOP-15191-004.patch
>
>
> Large scale DistCP with the -delete option doesn't finish in a viable time
> because of the final CopyCommitter doing a 1 by 1 delete of all missing
> files. This isn't randomized (the list is sorted), and it's throttled by AWS.
> If bulk deletion of files was exposed as an API, distCP would do 1/1000 of
> the REST calls, so not get throttled.
> Proposed: add an initially private/unstable interface for stores,
> {{BulkDelete}} which declares a page size and offers a
> {{bulkDelete(List<Path>)}} operation for the bulk deletion.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]