[ 
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930952#comment-15930952
 ] 

ASF GitHub Bot commented on HADOOP-13371:
-----------------------------------------

GitHub user kazuyukitanimura opened a pull request:

    https://github.com/apache/hadoop/pull/203

    HADOOP-13371. S3A globber to use bulk listObject call over recursive 
directory scan

    Hi @steveloughran 
    
    This pull request is for fixing (mitigating) the issue of 
[HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371).
    
    With this patch, it now passes the filter before glob happens.
    
    I had an issue of getting OOM for globbing large s3 buckets before since it 
kept all possible paths and the filtering happened at the end. Now this patch 
prunes unnecessary paths with the filter first. I applied this patch to our 
production pipelines, things run flawlessly.
    This should be applicable to branch-2.8 as well.
    
    Thanks in advance for reviewing this.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bloomreach/hadoop trunk

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hadoop/pull/203.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #203
    
----
commit 5d6b3e1ebb97cc11479db6c30b0a1a04986c4967
Author: kazu <[email protected]>
Date:   2017-03-18T00:24:41Z

    HADOOP-13371. S3A globber to use bulk listObject call over recursive 
directory scan

----


> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-13371
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13371
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in 
> {{FileSystem.listStatus}} calls, but doesn't do anything for 
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one 
> which does a selective recursive scan by pattern matching as it goes down, 
> filtering out those patterns which don't match. Cost is 
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the 
> filtered treewalk, but through a list + filter operation. This would be an 
> O(files) lookup *before any filtering took place*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to