[jira] [Commented] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

Chris Nauroth (JIRA) Mon, 01 Aug 2016 15:37:43 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402954#comment-15402954
 ]


Chris Nauroth commented on HADOOP-13208:
----------------------------------------

{quote}
What is this for? I see it was introduced in HADOOP-13131, but don't see
any usage of the configuration flag it sets (fs.s3a.impl.disable.cache).
{quote}

[~fabbri], sorry, I missed this comment last time around.

This logic disables the {{FileSystem}} cache.  The logic for this is inside the 
{{FileSystem}} base class.  (See the static {{FileSystem#get(URI, 
Configuration)}} method.)  Whether or not the cache is in effect is determined 
by a dynamic configuration property: {{fs.<scheme>.impl.disable.cache}}.

In some cases, we might have a test suite with multiple test methods that all 
want to test slightly different configuration.  However, all tests in a suite 
execute in the same process, so there would be a risk of reusing a cached 
instance across multiple tests.  Unfortunately, the caching is not sensitive to 
changes in the {{Configuration}} instance.  (The cache key is the scheme, 
authority, and {{UserGroupInformation}}.)  To work around that, we disable the 
cache for these tests.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13208
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

Reply via email to