[ 
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375227#comment-15375227
 ] 

Steve Loughran commented on HADOOP-13371:
-----------------------------------------

An S3A-specific globStatus would deliver significant speedup in the case of any 
scan down more than one level of path. 

It could also pull out some of the stuff in Globber needed to deal with Windows 
paths and some HDFS-related symlink and invisible ".snapshot" file code, code 
which appears to increase the number of calls to getFileStatus(), 

However, {{Globber}} is a package-private class which throws everything into 
one place, has its tests in hadoop-hdfs, and isn't set up for 
subclassing/overriding.

Any object-store specific globber would have to first rework Globber for 
subclassing (or make their own copy entirely). Copy-and-expand is simplest, 
albeit at maintenance cost risk. All tests which stress it (e.g. the HDFS ones) 
would need to replicated or abstracted up into some FS contract tests.

> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-13371
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13371
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in 
> {{FileSystem.listStatus}} calls, but doesn't do anything for 
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one 
> which does a selective recursive scan by pattern matching as it goes down, 
> filtering out those patterns which don't match. Cost is 
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the 
> filtered treewalk, but through a list + filter operation. This would be an 
> O(files) lookup *before any filtering took place*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to