[ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:
------------------------------------
    Attachment: HADOOP-13208-branch-2-001.patch

Patch 001

incorporates HADOOP-13207 
* define what the various list* commands do
* adds tests in the get file status contract test suite, which are 
automatically picked up by all implementation subclasses classes.
* adds some root dir listing tests to {{AbstractContractRootDirectoryTest}}, 
not because they manipulate the root, but to guarantee that test runs which 
parallelize the tests (e.g. hadoop-aws) run the root listing tests in a suite 
which would have to be serialized already.

And in S3A

* pull out the {{listObject}} request construction
* add a {{RemoteIterator}} {{ObjectListingIterator}} to go through the results 
of a listing
* add the {{RemoteIterator}} implementations {{FileStatusListingIterator}} and 
{{LocatedFileStatusIterator}} to take an {{ObjectListingIterator}}, do 
filtering and file status creation when iterated over.
* changed the listing operations to use these iterators, where appropriate.

The result is that we have a fairly functional programming-ish model of working 
across the listed entities, and can do what is essentially O(1) listing of 
directory trees. Specifically, It's {{1+ O(files/max-keys)}}: the number of 
child directories is not a factor except in returning some extra paths to be 
discarded.

Testing: S3 frankfurt, also tested azure, openstack, localfs, hdfs


> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13208
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13208-branch-2-001.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to