[ 
https://issues.apache.org/jira/browse/HADOOP-16465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895374#comment-16895374
 ] 

Steve Loughran commented on HADOOP-16465:
-----------------------------------------

This is potentially a major speedup in our treewalkage because although there's 
still be one LIST per directory entry, there's only one, rather than the HEAD, 
HEAD, LIST, LIST sequence today. That is, we could cut the # of HTTP requests 
down by a quarter, during the directory scanning process taking place before a 
query.



> S3AFileSystem.listLocatedStatus to LIST before HEAD
> ---------------------------------------------------
>
>                 Key: HADOOP-16465
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16465
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Steve Loughran
>            Priority: Major
>
> Looking at logs of LocatedFileStatus/FileInputFormat scans; there's a 
> needless call to getFileStatus whenever a S3AFileSystem.listLocatedStatus() 
> call is made
> # {{S3AFileSystem.listLocatedStatus()}} does a getFileStatus call, returns 
> the file status first
> # But if you look at all the uses in the MR code in FileInputFormat and 
> LocatedFileStatusFetcher, they only call this method *knowing the destination 
> is a directory*
> Which means for every unguarded S3 path: two needless HEADS and a single 
> entry LIST, before the real LIST is initiated.
> If the S3A FS can assume that a dest is a non-empty directory, then it can go 
> straight to the LIST operation, only falling back to the HEAD + HEAD +/ if 
> that fails.
> We could also think about doing the same for listStatus



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to