[
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943051#comment-15943051
]
ASF GitHub Bot commented on HADOOP-13371:
-----------------------------------------
Github user steveloughran commented on the issue:
https://github.com/apache/hadoop/pull/204
What an s3a globber can do is skip the recursive treewalk of
pseudo-directories as it does the regexp. Instead it could ask for all children
of a path, before doing the filtering. Look at HADOOP-13208 for the details.
This patch is the preamble: a copy of the existing globber, and the initial
performance test which would be the baseline for measuring speedups. Look at
some of the other dir operations to see how we measure that: it's not elapsed
time, but in number of HTTP requests made.
I should warn that this is not trivial to do well. I put the work aside
once I came up in my head of some example directory layouts which would
overload a a naive "ask for all then filter" scheme; any deep& wide tree where
the wildcard was near the top of the tree and very selective would end up
asking for way too much data, and discarding it all. Listing is complex
for now we're doing s3guard, which switches to dynamo DB for consistency as
well as performance. If you do want to improve client performance, this is
somewhere where we would all benefit from you getting involved. I don't see any
of us going near other bits of speedup until after s3guard is in, because
everything will be stamping on the same lines of code and making merging a
pain. And s3guard is probably going to obsolete a lot of the speedup proposed
here on any bucket which has the DDB backing tables.
> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
> Key: HADOOP-13371
> URL: https://issues.apache.org/jira/browse/HADOOP-13371
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs, fs/s3
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in
> {{FileSystem.listStatus}} calls, but doesn't do anything for
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one
> which does a selective recursive scan by pattern matching as it goes down,
> filtering out those patterns which don't match. Cost is
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the
> filtered treewalk, but through a list + filter operation. This would be an
> O(files) lookup *before any filtering took place*.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]