[
https://issues.apache.org/jira/browse/HADOOP-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410875#comment-15410875
]
Steve Loughran commented on HADOOP-13430:
-----------------------------------------
looks good, we'll all need to test this, especially against Hive and spark perf
runs.
One thing we don't have for the s3/object store tests is a good real-world
directory tree; our local tests only create small and unrealistic tree views
—there's a risk that code (mine especially) optimises for that test layout, not
real world ones. That'll get even worse once we look at globStatus
optimisation, where the glob patterns for queries need to be realistic too
Given you are clearly using this in production, is there a way you could share
some of the directory structure & query patterns with us? If we had some text
file which listed all the paths, we could have a (manually invoked) test case
which would read this and generate the directory tree —which could then be used
by all tests looking at metadata performance. We wouldn't need contents of
files, or the real names, but knowing things like dates in the layout & file
extensions, along with any globStatus calls, would help make for realistic
operations.
> Optimize and fix getFileStatus in S3A
> -------------------------------------
>
> Key: HADOOP-13430
> URL: https://issues.apache.org/jira/browse/HADOOP-13430
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.8.0
> Reporter: Steven K. Wong
> Priority: Minor
> Attachments: HADOOP-13430.001.WIP.patch
>
>
> Currently, S3AFileSystem.getFileStatus(Path f) sends up to 3 requests to S3
> when pathToKey(f) = key = "foo/bar" is a directory:
> 1. HEAD key=foo/bar \[continue if not found]
> 2. HEAD key=foo/bar/ \[continue if not found]
> 3. LIST prefix=foo/bar/ delimiter=/ max-keys=1
> My experience (and generally true, I reckon) is that almost all directories
> are nonempty directories without a "fake directory" file (e.g. "foo/bar/").
> Under this condition, request #2 is mostly unhelpful; it only slows down
> getFileStatus. Therefore, I propose swapping the order of requests #2 and #3.
> The swapped HEAD request will be skipped in practically all cases.
> Furthermore, when key = "foo/bar" is a nonempty directory that contains a
> "fake directory" file (in addition to actual files), getFileStatus currently
> returns an S3AFileStatus with isEmptyDirectory=true, which is wrong. Swapping
> will fix this. The swapped LIST request will use max-keys=2 to determine
> isEmptyDirectory correctly. (Removing the delimiter from the LIST request
> should make the logic a little simpler than otherwise.)
> Note that key = "foo/bar/" has the same problem with isEmptyDirectory. To fix
> it, I propose skipping request #1 when key ends with "/". The price is this
> will, for an empty directory, replace a HEAD request with a LIST request
> that's generally more taxing on S3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]