[
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931314#comment-15931314
]
Steve Loughran commented on HADOOP-13371:
-----------------------------------------
I've hit the submit patch button; that'll apply it to the code.
However, we do have extra policy for the object store tests, related to
declaration of what happened to the S3 test runs:
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
Looking @ the patch itself, it is trying to do this by fixing the FileSystem
base classes. That's generally harder to do, as it runs a greater risk of
breaking things...it also means that backporting is less likely.
# hadoop trunk should be the base hadoop version to work with; as long as you
keep use of java8 features to closures, stuff can be backported to branch-2.
Things like HADOOP-13519 will conflict with the new Filter field of Path, and
like I warned, that's not going to be allowed.
Where I think we can start is in S3a itself, with its own globber, which we can
optimise purely for S3a.
I've started that a while back; I'll push it up as a PR, even though it
currently doesn't do anything useful except add a new scale test intended to
look at the glob performance.
Why did I stop working on it? I put it to one side as the complexity got too
much for me.
What I wanted to do was do the full HADOOP-13208 recursive list status, that is
listFiles(path, true), picking up all the children which the glob patten
matches. The problem here is what about paths which get very wide fast, but for
which only a small amount of the path actually matches the glob. Example,
event/YYYY-MM-DD-HH/data.orc where there are many subdirectories of /event, one
for each hour. A glob event/2016-02-1*-17/*.orc would seem to be at risk of
collecting all of them and pulling back too much stuff. But actually, your
pushing the filter down can only improve today's scan by removing one
generation of list creation, new ones can be added later. I think I could apply
a variant of your patch to my fork and get some speedup.
Looking at the code though, the glob() command still returns an array, so has
to calculate the results. This kills things against very large datasets, the
one I use to show this being the landsat set.
We could start with a new glob call which returns a
{{RemoteIterator<LocatedFileStatus>}} call, and so allows clients to iterate
over the data. The original "legacy" call would be implemented in S3A by this,
as it would just do the iterate to build up the array. Now, an s3a only API
call isn't going to be that ideal —but it can start us thinking what is the
right API to pull up to FileSystem.
Now, for real wildcard performance, I'd been wondering if we could be clever
about path depth & be able to config the S3FS to say "if there is a patten in
the last 2 elements", do the bulk listing, or enable it for 3, 4, 5, elements
etc. That way you could optimise a setting for the layout of your FS. Trouble
there though: you need to know in advance and configure it for the FS *unless
we provide something like a filter which implements more policy*
thoughts?
> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
> Key: HADOOP-13371
> URL: https://issues.apache.org/jira/browse/HADOOP-13371
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs, fs/s3
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in
> {{FileSystem.listStatus}} calls, but doesn't do anything for
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one
> which does a selective recursive scan by pattern matching as it goes down,
> filtering out those patterns which don't match. Cost is
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the
> filtered treewalk, but through a list + filter operation. This would be an
> O(files) lookup *before any filtering took place*.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]