[jira] [Commented] (HADOOP-13371) S3A globber to use bulk listObject call over recursive directory scan

Steve Loughran (JIRA) Sat, 18 Mar 2017 11:31:11 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931314#comment-15931314
 ]


Steve Loughran commented on HADOOP-13371:
-----------------------------------------

I've hit the submit patch button; that'll apply it to the code.

However, we do have extra policy for the object store tests, related to 
declaration of what happened to the S3 test runs: 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md

Looking @ the patch itself, it is trying to do this by fixing the FileSystem 
base classes. That's generally harder to do, as it runs a greater risk of 
breaking things...it also means that backporting is less likely. 

# hadoop trunk should be the base hadoop version to work with; as long as you 
keep use of java8 features to closures, stuff can be backported to branch-2. 
Things like HADOOP-13519 will conflict with the new Filter field of Path, and 
like I warned, that's not going to be allowed.

Where I think we can start is in S3a itself, with its own globber, which we can 
optimise purely for S3a.

I've started that a while back; I'll push it up as a PR, even though it 
currently doesn't do anything useful except add a new scale test intended to 
look at the glob performance.

Why did I stop working on it? I put it to one side as the complexity got too 
much for me.

What I wanted to do was do the full HADOOP-13208 recursive list status, that is 
listFiles(path, true), picking up all the children which the glob patten 
matches. The problem here is what about paths which get very wide fast, but for 
which only a small amount of the path actually matches the glob. Example, 
event/YYYY-MM-DD-HH/data.orc where there are many subdirectories of /event, one 
for each hour. A glob event/2016-02-1*-17/*.orc would seem to be at risk of 
collecting all of them and pulling back too much stuff. But actually, your 
pushing the filter down can only improve today's scan by removing one 
generation of list creation, new ones can be added later. I think I could apply 
a variant of your patch to my fork and get some speedup.

Looking at the code though, the glob() command still returns an array, so has 
to calculate the results. This kills things against very large datasets, the 
one I use to show this being the landsat set.

We could start with a new glob call which returns a 
{{RemoteIterator<LocatedFileStatus>}} call, and so allows clients to iterate 
over the data. The original "legacy" call would be implemented in S3A by this, 
as it would just do the iterate to build up the array. Now, an s3a only API 
call isn't going to be that ideal —but it can start us thinking what is the 
right API to pull up to FileSystem. 

Now, for real wildcard performance, I'd been wondering if we could be clever 
about path depth & be able to config the S3FS to say "if there is a patten in 
the last 2 elements", do the bulk listing, or enable it for 3, 4, 5, elements 
etc. That way you could optimise a setting for the layout of your FS. Trouble 
there though: you need to know in advance and configure it for the FS *unless 
we provide something like a filter which implements more policy*

thoughts?




> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-13371
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13371
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in 
> {{FileSystem.listStatus}} calls, but doesn't do anything for 
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one 
> which does a selective recursive scan by pattern matching as it goes down, 
> filtering out those patterns which don't match. Cost is 
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the 
> filtered treewalk, but through a list + filter operation. This would be an 
> O(files) lookup *before any filtering took place*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13371) S3A globber to use bulk listObject call over recursive directory scan

Reply via email to