Steve Loughran created HADOOP-16465:
---------------------------------------
Summary: S3AFileSystem.listLocatedStatus to LIST before HEAD
Key: HADOOP-16465
URL: https://issues.apache.org/jira/browse/HADOOP-16465
Project: Hadoop Common
Issue Type: Sub-task
Components: fs/s3
Affects Versions: 3.2.0
Reporter: Steve Loughran
Looking at logs of LocatedFileStatus/FileInputFormat scans; there's a needless
call to getFileStatus whenever a S3AFileSystem.listLocatedStatus() call is made
# {{S3AFileSystem.listLocatedStatus()}} does a getFileStatus call, returns the
file status first
# But if you look at all the uses in the MR code in FileInputFormat and
LocatedFileStatusFetcher, they only call this method *knowing the destination
is a directory*
Which means for every unguarded S3 path: two needless HEADS and a single entry
LIST, before the real LIST is initiated.
If the S3A FS can assume that a dest is a non-empty directory, then it can go
straight to the LIST operation, only falling back to the HEAD + HEAD +/ if that
fails.
We could also think about doing the same for listStatus
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]