[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760241#comment-13760241
 ] 

Jason Lowe commented on HADOOP-9912:
------------------------------------

Thanks for the behavior matrix, Colin.  I think the issue of 
compatible/incompatible is about *expectations* of the FileSystem listStatus 
API.  FileSystem hasn't supported symlinks until very recently, and as a result 
I doubt many, if any, symlinks were being used in HDFS.  It required custom 
Java code to manipulate them and nothing written with FileSystem would work 
with them.

I am under the impression that we want symlinks to "just work" for the majority 
of existing applications.  If that's the case then we need to avoid exposing 
raw symlinks as results from the existing FileSystem APIs as callers aren't 
expecting to deal with them.  A directory walker is the classic case of this, 
as it will expect isDir() to tell it when to traverse subdirectories and 
symlinks to directories breaks that assumption.

A proposal to keep the existing FileSystem users working with symlinks in HDFS:

- listStatus resolves symlinks when possible.  If the symlink cannot be 
resolved (e.g.: dangling, permission-restricted target path, etc.) it will 
return the status of the symlink since it cannot stat the symlink target.
- A separate API, either an overload of listStatus with an extra flag to 
control symlink resolution or a separate listLinkStatus, can be used for 
callers that always want the symlink status and not the status of the symlink 
target.  I would not expect the majority of existing listStatus callers to want 
to see symlinks and have to resolve them.  This is akin to the 
getFileStatus/getFileLinkStatus pairing.  Existing callers of getFileStatus 
never expected symlinks so that's why it always follows them and a new API was 
added to examine the symlink itself rather than adding a new status API to 
always follow the symlink.

For me it's all about what callers are expecting FileSystem's listStatus 
semantics to be.  I believe that existing callers are *not* expecting symlinks 
to be returned since FileSystem never supported them in the past and I doubt 
they were being used in HDFS in general.  Most callers are expecting listStatus 
to be a readdir and stat, and stat follows symlinks.  If listStatus does not 
resolve symlinks then it breaks existing Pig and MapReduce code, and I believe 
that's an indication it will break a lot more code out there.  The code that 
breaks can be updated to understand symlinks, but I believe in practice that 
means symlinks to directories will be fragile for a long time.  Each tool that 
encounters them will have to be updated to check for them and behave 
accordingly.
                
> globStatus of a symlink to a directory does not report symlink as a directory
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-9912
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9912
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to