[
https://issues.apache.org/jira/browse/HADOOP-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515328#comment-16515328
]
Thomas Marquardt commented on HADOOP-15547:
-------------------------------------------
Attaching patch HADOOP-15547.001.patch. This reduces the time to list 700,000
files in a single directory by 10x. Changes include
1) Instead of O(n!) algorithm to remove duplicates, an O(1) algorithm is used
via a HashMap. This helps reduces the time to list 700,000 files to less than
3 minutes, compared to over 30 minutes without the fix against my US West
storage account.
2) Avoid an extra copy of the metadata by having FileMetadata inherit from
FileStatus.
3) Simplify list related code, which had a couple of parameters (delimiter and
priorLastKey) that were never used.
4) Add a scale test (ITestListPerformance) to validate the performance of
listing 700k files. Because it is a scale test, it is skipped unless you set
fs.azure.scale.test.enabled to true.
All tests are passing against my US West storage account:
Tests run: 241, Failures: 0, Errors: 0, Skipped: 11
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
Tests run: 446, Failures: 0, Errors: 0, Skipped: 55
Tests run: 126, Failures: 0, Errors: 0, Skipped: 10
> WASB: listStatus performance
> ----------------------------
>
> Key: HADOOP-15547
> URL: https://issues.apache.org/jira/browse/HADOOP-15547
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/azure
> Affects Versions: 2.9.1, 3.0.2
> Reporter: Thomas Marquardt
> Assignee: Thomas Marquardt
> Priority: Major
> Attachments: HADOOP-15547.001.patch
>
>
> The WASB implementation of Filesystem.listStatus is very slow due to O(n!)
> algorithm to remove duplicates and uses too much memory due to the extra
> conversion from BlobListItem to FileMetadata to FileStatus. It takes over 30
> minutes to list 700,000 files.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]