[jira] [Commented] (HADOOP-9984) FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by default

Daryn Sharp (JIRA) Fri, 27 Sep 2013 12:56:26 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780306#comment-13780306
 ]


Daryn Sharp commented on HADOOP-9984:
-------------------------------------

I haven't had a chance to review the patch, but I came here to make another 
point which you serendipitously touched on.

bq. I added a new API, listLinkStatus, which is like listStatus, but does not 
resolve symlinks. listLinkStatus is necessary here, since _globStatus needs to 
glob on file name, not target name_

Both {{listStatus}} and {{listLinkStatus}} must return file stats with the 
exact same paths.  Those paths must be constructed from the "unresolved" paths. 
 +The only difference is whether the stat is for the symlink itself, or the 
stat of the resolved path.+

I also want to note that I've come to the conclusion that cross-namespace 
symlinks, at least for FileSystem, are a terrible and dangerous idea.

*Performance*
Returning unresolved paths is transparent and compatible for users, but every 
op will bang on every namespace it hops through.

*User and upper stack component code will break*
The frequent pattern of usage is obtaining the fs for a path, then assuming 
that all paths derived from list/glob stats can be used with the original fs.  
Upper layer stack components already  break enough when using multiple paths 
that aren't on the same fs which has completely stalled federation here.  They 
assume all paths use the same fs as the first path.  Cross-namespace symlinks 
aggravate those bugs further.

*Silent Data Loss*
Let's say I generate a file listing and feed that into another app.  That app 
filter for paths that start with a particular prefix.  Returning a path with a 
resolved link path, not including the original prefix, will cause the filter to 
miss the path which will result in dropped input data.

*Security Attacks*
It opens a whole new level of security issues involving symlink attacks.  Let's 
say a privileged server blocks access to certain schemes like "file" but 
happily accepts "hdfs" paths.  As a devious user, I now create a symlink in 
hdfs back to local filesystem.  I use this link to steal your keytab or maybe 
scribble over your config.

*Jobs & Delegation Tokens*
Job submission gets tokens for the input/output paths.  Let's say I submit with 
paths to cluster1.  I have no idea that my input might, or might someday, 
contain symlinks to cluster2/3/4.  Sans token, the task will fail when it tries 
to follow a link to the other clusters.  So what's the user to do?  Hardcode 
the job config to get tokens for _every namespace_ it _might_ access?  Now the 
single point of failure has been multiplied.  HA doesn't solve that because 
network rifts will cause token acquisition to fail for clusters the job didn't 
really need.

–

In essence, cross-namespace symlinks will never be transparent to the user.  
Even if a user careful hardcodes his namenodes in the job conf (ug), another 
user or SE can screw them by creating a symlink to another namespace.

I think cross-namespace symlink support needs to be dropped.  Counterpoints?
                
> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by 
> default
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-9984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9984
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.1.0-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Blocker
>         Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch, 
> HADOOP-9984.005.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that 
> many existing HDFS clients would be broken by listStatus and globStatus 
> returning symlinks.  One example is applications that assume that 
> !FileStatus#isFile implies that the inode is a directory.  As we discussed in 
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning 
> resolved paths.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-9984) FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by default

Reply via email to