[
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786431#comment-13786431
]
Daryn Sharp commented on HADOOP-9984:
-------------------------------------
bq. We agreed that doing the symlink resolution in each Filesystem subclass is
what we ought to do in 9984, in order to keep compatibility with out-of-tree
filesystems.
Yesterday, I was only able to briefly touch on the issue highlighted in
HADOOP-10014: stacked filesystems like viewfs cannot work correctly if path
resolution is performed on a per-filesystem basis.
I think we need to reconsider how we are going to alter the APIs. Perhaps the
existing APIs return the symlinks, as today. It's the responsibility of a
symlink-aware fs wrapper to manage the symlink resolution so it's done in a
top-down manner.
Let's take an example:
viewfs://table/mount -> hdfs://host:port/some/path
hdfs://host:port/some/path/link is a symlink to "/mount2"
Does viewfs://table/mount/path/link refer to viewfs://table/mount2 or to
hdfs://host:port/mount2? Essentially, is it logical for a symlink on a
low-level fs to resolve relative to itself, rather than be resolved by the
top-level fs? If you think it should resolve relative to itself, then it means
a chroot fs is meaningless because I can unexpectedly walk out of the chroot.
I'm even sure how viewfs/chroot will react to a resolved path that went out of
the mount/chroot.
bq. We discussed the issue of returning resolved paths versus unresolved paths,
but were unable to come to any conclusion. Everyone agreed that there would be
serious performance problems if we returned unresolved paths, but some claimed
that programs would break when encountering resolved paths.
I think it's fair to s/some claimed/most claimed/. Let's explore some examples
of why returning qualified paths is incorrect:
+PathFilters+
I have {{/proj}} containing {{daryn-2012}} and {{daryn-2013}}. I use
{{listStatus(Path, PathFilter)}}. PathFilters operate on the full path. I
wrote my filter to expect {{/proj/daryn-2012}} and {{/proj/daryn-2013}} so I'm
filtering for {{/proj/daryn-}}. It works fine today.
Now let's say {{/proj/daryn-2012}} is a symlink to either
{{/archive/daryn-2012}}. With a resolved path, my prefix matching path filter
will silently skip the file. If you think my path filter should have just
filtered on the basename, what if {{/proj/daryn-2012}} is a symlink to
{{/archive/daryn/2012}}?
It's impossible for my path filter to work correctly with resolved paths.
+General Filtering+
I have an app that generates a file list. I feed it to something else that
filters based on some path criteria. The resolved paths may no longer match
the pattern and be silently dropped.
On the flip side, maybe the unresolved paths didn't and weren't expected to
match a name pattern. But the target does. I just silently picked up
unexpected data!
+Duplicates+
I have a directory with symlinks {{/dir/link1}} and {{/dir/link2}} both point
to {{/dir2/file}}. I do {{listStatus("/dir")}} - should I get dup paths in the
result? How will applications react to that?
What if I intended to rename both $path to $path.old. I expected
{{/dir/link1.old}} and {{/dir/link2.old}}. With resolved paths, I directly
renamed {{/dir2/file}} to {{/dir2/file.old}}. Then because of the duplicate
path from list status, it tries again and fails because it's already been
renamed. Add on top that my symlinks are now broken/dangling.
+Copying+
I have {{/dir1/link}} pointing to {{/dir2/real}}. I do a
{{listStatus("/dir1")}} and intend to copy the files into {{/dir3}}. Instead
of {{/dir3/link}}, I get {{/dir3/real}}.
+Delete+
Again, I have a directory with a symlinks elsewhere. I do a listStatus, walk
the entries intending to delete paths that match some criteria. While one
would naturally expect the symlink to be deleted, the resolved path made me
delete the target!! That's destruction of data!
---
In short, despite the performance impact, the only correct thing to do is
return the unresolved path. Symlinks must be completely invisible to pre-2.x
applications or all sorts of unexpected behavior will occur that may drop data,
destroy data, etc. Debugging these odd cases in production will be near
possible, if they can even be detected in the first place.
As much as I want symlinks, the full impact of the implementation was not fully
considered. The current implementation is dangerous in many ways.
> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by
> default
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-9984
> URL: https://issues.apache.org/jira/browse/HADOOP-9984
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs
> Affects Versions: 2.1.0-beta
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Priority: Blocker
> Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch,
> HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch,
> HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch,
> HADOOP-9984.013.patch, HADOOP-9984.014.patch, HADOOP-9984.015.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that
> many existing HDFS clients would be broken by listStatus and globStatus
> returning symlinks. One example is applications that assume that
> !FileStatus#isFile implies that the inode is a directory. As we discussed in
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning
> resolved paths.
--
This message was sent by Atlassian JIRA
(v6.1#6144)