[
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375227#comment-15375227
]
Steve Loughran commented on HADOOP-13371:
-----------------------------------------
An S3A-specific globStatus would deliver significant speedup in the case of any
scan down more than one level of path.
It could also pull out some of the stuff in Globber needed to deal with Windows
paths and some HDFS-related symlink and invisible ".snapshot" file code, code
which appears to increase the number of calls to getFileStatus(),
However, {{Globber}} is a package-private class which throws everything into
one place, has its tests in hadoop-hdfs, and isn't set up for
subclassing/overriding.
Any object-store specific globber would have to first rework Globber for
subclassing (or make their own copy entirely). Copy-and-expand is simplest,
albeit at maintenance cost risk. All tests which stress it (e.g. the HDFS ones)
would need to replicated or abstracted up into some FS contract tests.
> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
> Key: HADOOP-13371
> URL: https://issues.apache.org/jira/browse/HADOOP-13371
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs, fs/s3
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in
> {{FileSystem.listStatus}} calls, but doesn't do anything for
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one
> which does a selective recursive scan by pattern matching as it goes down,
> filtering out those patterns which don't match. Cost is
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the
> filtered treewalk, but through a list + filter operation. This would be an
> O(files) lookup *before any filtering took place*.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]