[ 
https://issues.apache.org/jira/browse/HADOOP-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410875#comment-15410875
 ] 

Steve Loughran commented on HADOOP-13430:
-----------------------------------------

looks good, we'll all need to test this, especially against Hive and spark perf 
runs.

One thing we don't have for the s3/object store tests is a good real-world 
directory tree; our local tests only create small and unrealistic tree views 
—there's a risk that code (mine especially) optimises for that test layout, not 
real world ones. That'll get even worse once we look at globStatus 
optimisation, where the glob patterns for queries need to be realistic too

Given you are clearly using this in production, is there a way you could share 
some of the directory structure & query patterns with us? If we had some text 
file which listed all the paths, we could have a (manually invoked) test case 
which would read this and generate the directory tree —which could then be used 
by all tests looking at metadata performance. We wouldn't need contents of 
files, or the real names, but knowing things like dates in the layout & file 
extensions, along with any globStatus calls, would help make for realistic 
operations.

> Optimize and fix getFileStatus in S3A
> -------------------------------------
>
>                 Key: HADOOP-13430
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13430
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steven K. Wong
>            Priority: Minor
>         Attachments: HADOOP-13430.001.WIP.patch
>
>
> Currently, S3AFileSystem.getFileStatus(Path f) sends up to 3 requests to S3 
> when pathToKey(f) = key = "foo/bar" is a directory:
> 1. HEAD key=foo/bar \[continue if not found]
> 2. HEAD key=foo/bar/ \[continue if not found]
> 3. LIST prefix=foo/bar/ delimiter=/ max-keys=1
> My experience (and generally true, I reckon) is that almost all directories 
> are nonempty directories without a "fake directory" file (e.g. "foo/bar/"). 
> Under this condition, request #2 is mostly unhelpful; it only slows down 
> getFileStatus. Therefore, I propose swapping the order of requests #2 and #3. 
> The swapped HEAD request will be skipped in practically all cases.
> Furthermore, when key = "foo/bar" is a nonempty directory that contains a 
> "fake directory" file (in addition to actual files), getFileStatus currently 
> returns an S3AFileStatus with isEmptyDirectory=true, which is wrong. Swapping 
> will fix this. The swapped LIST request will use max-keys=2 to determine 
> isEmptyDirectory correctly. (Removing the delimiter from the LIST request 
> should make the logic a little simpler than otherwise.)
> Note that key = "foo/bar/" has the same problem with isEmptyDirectory. To fix 
> it, I propose skipping request #1 when key ends with "/". The price is this 
> will, for an empty directory, replace a HEAD request with a LIST request 
> that's generally more taxing on S3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to