[ 
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943051#comment-15943051
 ] 

ASF GitHub Bot commented on HADOOP-13371:
-----------------------------------------

Github user steveloughran commented on the issue:

    https://github.com/apache/hadoop/pull/204
  
    What an s3a globber can do is skip the recursive treewalk of 
pseudo-directories as it does the regexp. Instead it could ask for all children 
of a path, before doing the filtering. Look at HADOOP-13208 for the details.
    
    This patch is the preamble: a copy of the existing globber, and the initial 
performance test which would be the baseline for measuring speedups. Look at 
some of the other dir operations to see how we measure that: it's not elapsed 
time, but in number of HTTP requests made.
    
    I should warn that this is not trivial to do well. I put the work aside 
once I came up in my head of some example directory layouts which would 
overload a a naive "ask for all then filter" scheme; any deep& wide tree where 
the wildcard was near the top of the tree and very selective would end up 
asking for way too much data, and discarding it all. Listing is complex
    
    for now we're doing s3guard, which switches to dynamo DB for consistency as 
well as performance. If you do want to improve client performance, this is 
somewhere where we would all benefit from you getting involved. I don't see any 
of us going near other bits of speedup until after s3guard is in, because 
everything will be stamping on the same lines of code and making merging a 
pain. And s3guard is probably going to obsolete a lot of the speedup proposed 
here on any bucket which has the DDB backing tables.
    
    



> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-13371
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13371
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in 
> {{FileSystem.listStatus}} calls, but doesn't do anything for 
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one 
> which does a selective recursive scan by pattern matching as it goes down, 
> filtering out those patterns which don't match. Cost is 
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the 
> filtered treewalk, but through a list + filter operation. This would be an 
> O(files) lookup *before any filtering took place*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to