[ https://issues.apache.org/jira/browse/HADOOP-16465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-16465. ------------------------------------- Fix Version/s: 3.4.0 Resolution: Fixed +1, merged to trunk. Mukund, if you want to apply to branch-3.3 (which I think it should) cherrypick and test and I'll merge it there too > Tune S3AFileSystem.listLocatedStatus > ------------------------------------ > > Key: HADOOP-16465 > URL: https://issues.apache.org/jira/browse/HADOOP-16465 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.2.0 > Reporter: Steve Loughran > Assignee: Mukund Thakur > Priority: Major > Fix For: 3.4.0 > > > Looking at logs of LocatedFileStatus/FileInputFormat scans; there's a > needless call to getFileStatus whenever a S3AFileSystem.listLocatedStatus() > call is made > # {{S3AFileSystem.listLocatedStatus()}} does a getFileStatus call, returns > the file status first > # But if you look at all the uses in the MR code in FileInputFormat and > LocatedFileStatusFetcher, they only call this method *knowing the destination > is a directory* > Which means for every unguarded S3 path: two needless HEADS and a single > entry LIST, before the real LIST is initiated. > If the S3A FS can assume that a dest is a non-empty directory, then it can go > straight to the LIST operation, only falling back to the HEAD + HEAD +/ if > that fails. > We could also think about doing the same for listStatus -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org