[
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780306#comment-13780306
]
Daryn Sharp commented on HADOOP-9984:
-------------------------------------
I haven't had a chance to review the patch, but I came here to make another
point which you serendipitously touched on.
bq. I added a new API, listLinkStatus, which is like listStatus, but does not
resolve symlinks. listLinkStatus is necessary here, since _globStatus needs to
glob on file name, not target name_
Both {{listStatus}} and {{listLinkStatus}} must return file stats with the
exact same paths. Those paths must be constructed from the "unresolved" paths.
+The only difference is whether the stat is for the symlink itself, or the
stat of the resolved path.+
I also want to note that I've come to the conclusion that cross-namespace
symlinks, at least for FileSystem, are a terrible and dangerous idea.
*Performance*
Returning unresolved paths is transparent and compatible for users, but every
op will bang on every namespace it hops through.
*User and upper stack component code will break*
The frequent pattern of usage is obtaining the fs for a path, then assuming
that all paths derived from list/glob stats can be used with the original fs.
Upper layer stack components already break enough when using multiple paths
that aren't on the same fs which has completely stalled federation here. They
assume all paths use the same fs as the first path. Cross-namespace symlinks
aggravate those bugs further.
*Silent Data Loss*
Let's say I generate a file listing and feed that into another app. That app
filter for paths that start with a particular prefix. Returning a path with a
resolved link path, not including the original prefix, will cause the filter to
miss the path which will result in dropped input data.
*Security Attacks*
It opens a whole new level of security issues involving symlink attacks. Let's
say a privileged server blocks access to certain schemes like "file" but
happily accepts "hdfs" paths. As a devious user, I now create a symlink in
hdfs back to local filesystem. I use this link to steal your keytab or maybe
scribble over your config.
*Jobs & Delegation Tokens*
Job submission gets tokens for the input/output paths. Let's say I submit with
paths to cluster1. I have no idea that my input might, or might someday,
contain symlinks to cluster2/3/4. Sans token, the task will fail when it tries
to follow a link to the other clusters. So what's the user to do? Hardcode
the job config to get tokens for _every namespace_ it _might_ access? Now the
single point of failure has been multiplied. HA doesn't solve that because
network rifts will cause token acquisition to fail for clusters the job didn't
really need.
–
In essence, cross-namespace symlinks will never be transparent to the user.
Even if a user careful hardcodes his namenodes in the job conf (ug), another
user or SE can screw them by creating a symlink to another namespace.
I think cross-namespace symlink support needs to be dropped. Counterpoints?
> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by
> default
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-9984
> URL: https://issues.apache.org/jira/browse/HADOOP-9984
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Affects Versions: 2.1.0-beta
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Priority: Blocker
> Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch,
> HADOOP-9984.005.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that
> many existing HDFS clients would be broken by listStatus and globStatus
> returning symlinks. One example is applications that assume that
> !FileStatus#isFile implies that the inode is a directory. As we discussed in
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning
> resolved paths.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira