[
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-13208:
------------------------------------
Status: Patch Available (was: Open)
tested agains s3 ireland. With the modified test, I'm showing a 20x speedup for
listing the specific directory tree created for this test:
{code}
2016-07-13 14:56:42,370 [Thread-8] INFO contract.ContractTestUtils
(ContractTestUtils.java:end(1362)) - Duration of List status via treewalk of 48
directories and 6 files: 5,964,773,672 nS
2016-07-13 14:56:42,370 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - object_metadata_requests starting=444
current=542 diff=98
2016-07-13 14:56:42,370 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - object_list_requests starting=132 current=184
diff=52
2016-07-13 14:56:42,370 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - object_continue_list_requests starting=0
current=0 diff=0
2016-07-13 14:56:42,370 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - op_list_files starting=0 current=0 diff=0
2016-07-13 14:56:42,370 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - op_get_file_status starting=225 current=274
diff=49
2016-07-13 14:56:42,371 [Thread-8] INFO scale.S3AScaleTestBase
(S3AScaleTestBase.java:describe(155)) -
testListOperations: Listing files via listFiles(recursive=true)
2016-07-13 14:56:42,579 [Thread-8] INFO contract.ContractTestUtils
(ContractTestUtils.java:end(1362)) - Duration of listFiles(recursive=true) of
48 directories and 6 files: 208,461,316 nS
2016-07-13 14:56:42,580 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - object_metadata_requests starting=542
current=544 diff=2
2016-07-13 14:56:42,580 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - object_list_requests starting=184 current=186
diff=2
2016-07-13 14:56:42,580 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - object_continue_list_requests starting=0
current=0 diff=0
2016-07-13 14:56:42,580 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - op_list_files starting=0 current=1 diff=1
2016-07-13 14:56:42,580 [Thread-8] INFO scale.TestS3ADirectoryPerformance
(S3ATestUtils.java:print(212)) - op_get_file_status starting=274 current=275
diff=1
2016-07-13 14:56:42,580 [Thread-8] INFO scale.S3AScaleTestBase
(S3AScaleTestBase.java:describe(155)) -
{code}
Obviously, that could be un-representative, but it does show the speedup we can
obtain
> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the
> pseudo-tree of directories
> --------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch,
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch,
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch,
> HADOOP-13208-branch-2-011.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing
> the directory tree itself. That's because against S3, it takes S3A two HEADs
> and two lists to list the content of any directory path (2 HEADs + 1 list for
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two
> listings. However, a listing of a directory tree will still be
> O(directories). In contrast, a recursive {{listFiles()}} operation should be
> implementable by a bulk listing of all descendant paths; one List operation
> per thousand descendants.
> As the result of this call is an iterator, the ongoing listing can be
> implemented within the iterator itself
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]