[ 
https://issues.apache.org/jira/browse/HADOOP-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515328#comment-16515328
 ] 

Thomas Marquardt commented on HADOOP-15547:
-------------------------------------------

 Attaching patch HADOOP-15547.001.patch.  This reduces the time to list 700,000 
files in a single directory by 10x.  Changes include

1) Instead of O(n!) algorithm to remove duplicates, an O(1) algorithm is used 
via a HashMap.  This helps reduces the time to list 700,000 files to less than 
3 minutes, compared to over 30 minutes without the fix against my US West 
storage account.

2) Avoid an extra copy of the metadata by having FileMetadata inherit from 
FileStatus.

3) Simplify list related code, which had a couple of parameters (delimiter and 
priorLastKey) that were never used.

4) Add a scale test (ITestListPerformance) to validate the performance of 
listing 700k files.  Because it is a scale test, it is skipped unless you set 
fs.azure.scale.test.enabled to true.

 

All tests are passing against my US West storage account:

Tests run: 241, Failures: 0, Errors: 0, Skipped: 11
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
Tests run: 446, Failures: 0, Errors: 0, Skipped: 55
Tests run: 126, Failures: 0, Errors: 0, Skipped: 10

 

> WASB: listStatus performance
> ----------------------------
>
>                 Key: HADOOP-15547
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15547
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/azure
>    Affects Versions: 2.9.1, 3.0.2
>            Reporter: Thomas Marquardt
>            Assignee: Thomas Marquardt
>            Priority: Major
>         Attachments: HADOOP-15547.001.patch
>
>
> The WASB implementation of Filesystem.listStatus is very slow due to O(n!) 
> algorithm to remove duplicates and uses too much memory due to the extra 
> conversion from BlobListItem to FileMetadata to FileStatus.  It takes over 30 
> minutes to list 700,000 files.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to