[
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799562#comment-13799562
]
Daryn Sharp commented on HADOOP-9984:
-------------------------------------
bq. listStatus should NOT follow child symlinks. Fix all internal utilities,
hive, pig, map reduce, yarn, etc to not use isDir() and understand that a
directory may contain symlinks.
I do not agree. This means symlinks are not transparent and not compatible
with pre-2.x. I also do not agree that any solution will/has to break existing
apps.
Furthermore, the user will rarely if ever care that something is a symlink. So
requiring every user that gets a file status through any of the existing API
methods should _not_ be burdened to check if it's a symlink, then resolve it
before checking various criteria - this is about more than just isDir(). What
if I'm checking file size? Or owner/group/permissions? I expect the results
to be of the target, not the link.
I think the only sensible solution to ensure compatibility:
# A new filtered fs wrapper whose sole responsibility is resolving symlinks.
FileSystem.get can automatically add the wrapper. If the user really wants to
see symlinks, they can call getRawFs.
# No other filesystem does symlink resolution of any kind. I've outlined in
other jiras how having individual filesystems resolve symlinks is fundamentally
broken, ex. viewfs.
# The new symlink aware fs wrapper will return file statuses for symlinks that
lazy resolve the file status ala RLFS. The lazy resolve handles the problem of
unresolvable symlinks, that the user wasn't going to select based on name, from
causing exceptions.
Let's make hadoop work like every other filesystem by making symlinks be
transparent unless the user explicitly wants to know if something is a symlink.
> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by
> default
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-9984
> URL: https://issues.apache.org/jira/browse/HADOOP-9984
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs
> Affects Versions: 2.1.0-beta
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Priority: Blocker
> Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch,
> HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch,
> HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch,
> HADOOP-9984.013.patch, HADOOP-9984.014.patch, HADOOP-9984.015.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that
> many existing HDFS clients would be broken by listStatus and globStatus
> returning symlinks. One example is applications that assume that
> !FileStatus#isFile implies that the inode is a directory. As we discussed in
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning
> resolved paths.
--
This message was sent by Atlassian JIRA
(v6.1#6144)