[
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760241#comment-13760241
]
Jason Lowe commented on HADOOP-9912:
------------------------------------
Thanks for the behavior matrix, Colin. I think the issue of
compatible/incompatible is about *expectations* of the FileSystem listStatus
API. FileSystem hasn't supported symlinks until very recently, and as a result
I doubt many, if any, symlinks were being used in HDFS. It required custom
Java code to manipulate them and nothing written with FileSystem would work
with them.
I am under the impression that we want symlinks to "just work" for the majority
of existing applications. If that's the case then we need to avoid exposing
raw symlinks as results from the existing FileSystem APIs as callers aren't
expecting to deal with them. A directory walker is the classic case of this,
as it will expect isDir() to tell it when to traverse subdirectories and
symlinks to directories breaks that assumption.
A proposal to keep the existing FileSystem users working with symlinks in HDFS:
- listStatus resolves symlinks when possible. If the symlink cannot be
resolved (e.g.: dangling, permission-restricted target path, etc.) it will
return the status of the symlink since it cannot stat the symlink target.
- A separate API, either an overload of listStatus with an extra flag to
control symlink resolution or a separate listLinkStatus, can be used for
callers that always want the symlink status and not the status of the symlink
target. I would not expect the majority of existing listStatus callers to want
to see symlinks and have to resolve them. This is akin to the
getFileStatus/getFileLinkStatus pairing. Existing callers of getFileStatus
never expected symlinks so that's why it always follows them and a new API was
added to examine the symlink itself rather than adding a new status API to
always follow the symlink.
For me it's all about what callers are expecting FileSystem's listStatus
semantics to be. I believe that existing callers are *not* expecting symlinks
to be returned since FileSystem never supported them in the past and I doubt
they were being used in HDFS in general. Most callers are expecting listStatus
to be a readdir and stat, and stat follows symlinks. If listStatus does not
resolve symlinks then it breaks existing Pig and MapReduce code, and I believe
that's an indication it will break a lot more code out there. The code that
breaks can be updated to understand symlinks, but I believe in practice that
means symlinks to directories will be fragile for a long time. Each tool that
encounters them will have to be updated to check for them and behave
accordingly.
> globStatus of a symlink to a directory does not report symlink as a directory
> -----------------------------------------------------------------------------
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt,
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the
> resulting FileStatus as a directory but recently this has changed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira