[
https://issues.apache.org/jira/browse/HADOOP-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395406#comment-15395406
]
Steve Loughran commented on HADOOP-13430:
-----------------------------------------
(I've just moved this under the S3a phase III JIRA —stuff for Hadoop 2.9)
Regarding the feature, yes, we need it. You can see from the metrics we're
collecting how expensive it is.
FWIW I did play with this, reordering the operations —but things didn't work so
I didn't create a JIRA. That's a "failed assertions" didn't work rather than
performance problems —so probably a bug in my edit.
There's a couple of other optimisation points to consider too
# sometimes, S3A checks internally for directories (e.g mkdirs). It may be able
to use some knowledge of above/below the tree to make better decisions, or at
least look for less information. Example: if looking to see if there is a fake
directory, there's no need to look for a non-fake one.
# sometimes the getFileStatus is to be followed immediately by (if it is a
directory), a listStatusCall. Examples: rename(), delete(). In these
situations, we ought to be able to ask for a bigger list in getFileStatus —and
feed the result straight into the next stage of the work. We'd get a bigger
result back from that first list, but a whole list call could be eliminated.
But there that strategy of dropping the delimiter is potentially dangerous; it
depends on which call is happening.
> Optimize and fix getFileStatus in S3A
> -------------------------------------
>
> Key: HADOOP-13430
> URL: https://issues.apache.org/jira/browse/HADOOP-13430
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.8.0
> Reporter: Steven K. Wong
> Priority: Minor
>
> Currently, S3AFileSystem.getFileStatus(Path f) sends up to 3 requests to S3
> when pathToKey(f) = key = "foo/bar" is a directory:
> 1. HEAD key=foo/bar \[continue if not found]
> 2. HEAD key=foo/bar/ \[continue if not found]
> 3. LIST prefix=foo/bar/ delimiter=/ max-keys=1
> My experience (and generally true, I reckon) is that almost all directories
> are nonempty directories without a "fake directory" file (e.g. "foo/bar/").
> Under this condition, request #2 is mostly unhelpful; it only slows down
> getFileStatus. Therefore, I propose swapping the order of requests #2 and #3.
> Furthermore, when key = "foo/bar" is a nonempty directory that contains a
> "fake directory" file (in addition to actual files), getFileStatus currently
> returns an S3AFileStatus with isEmptyDirectory=true, which is wrong. Swapping
> will fix this. The swapped LIST request will use max-keys=2 to determine
> isEmptyDirectory correctly. The swapped HEAD request will be skipped if the
> directory is empty. (Removing the delimiter from the LIST request should make
> the logic a little simpler than otherwise.)
> Note that key = "foo/bar/" has the same problem with isEmptyDirectory. To fix
> it, I propose skipping request #1 when key ends with "/". The price is this
> will, for an empty directory, replace a HEAD request with a LIST request
> that's generally more taxing on S3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]