[ 
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786431#comment-13786431
 ] 

Daryn Sharp commented on HADOOP-9984:
-------------------------------------

bq.  We agreed that doing the symlink resolution in each Filesystem subclass is 
what we ought to do in 9984, in order to keep compatibility with out-of-tree 
filesystems.

Yesterday, I was only able to briefly touch on the issue highlighted in 
HADOOP-10014: stacked filesystems like viewfs cannot work correctly if path 
resolution is performed on a per-filesystem basis.

I think we need to reconsider how we are going to alter the APIs.  Perhaps the 
existing APIs return the symlinks, as today.  It's the responsibility of a 
symlink-aware fs wrapper to manage the symlink resolution so it's done in a 
top-down manner.

Let's take an example:

viewfs://table/mount -> hdfs://host:port/some/path
hdfs://host:port/some/path/link is a symlink to "/mount2"

Does viewfs://table/mount/path/link refer to viewfs://table/mount2 or to 
hdfs://host:port/mount2?  Essentially, is it logical for a symlink on a 
low-level fs to resolve relative to itself, rather than be resolved by the 
top-level fs?  If you think it should resolve relative to itself, then it means 
a chroot fs is meaningless because I can unexpectedly walk out of the chroot.

I'm even sure how viewfs/chroot will react to a resolved path that went out of 
the mount/chroot.

bq. We discussed the issue of returning resolved paths versus unresolved paths, 
but were unable to come to any conclusion. Everyone agreed that there would be 
serious performance problems if we returned unresolved paths, but some claimed 
that programs would break when encountering resolved paths.

I think it's fair to s/some claimed/most claimed/.  Let's explore some examples 
of why returning qualified paths is incorrect:

+PathFilters+
I have {{/proj}} containing {{daryn-2012}} and {{daryn-2013}}.  I use 
{{listStatus(Path, PathFilter)}}. PathFilters operate on the full path.  I 
wrote my filter to expect {{/proj/daryn-2012}} and {{/proj/daryn-2013}} so I'm 
filtering for {{/proj/daryn-}}.  It works fine today.

Now let's say {{/proj/daryn-2012}} is a symlink to either 
{{/archive/daryn-2012}}.  With a resolved path, my prefix matching path filter 
will silently skip the file.  If you think my path filter should have just 
filtered on the basename, what if {{/proj/daryn-2012}} is a symlink to 
{{/archive/daryn/2012}}?

It's impossible for my path filter to work correctly with resolved paths.

+General Filtering+
I have an app that generates a file list.  I feed it to something else that 
filters based on some path criteria.  The resolved paths may no longer match 
the pattern and be silently dropped.

On the flip side, maybe the unresolved paths didn't and weren't expected to 
match a name pattern.  But the target does.  I just silently picked up 
unexpected data!

+Duplicates+
I have a directory with symlinks {{/dir/link1}} and {{/dir/link2}} both point 
to {{/dir2/file}}.  I do {{listStatus("/dir")}} - should I get dup paths in the 
result?  How will applications react to that?

What if I intended to rename both $path to $path.old.  I expected 
{{/dir/link1.old}} and {{/dir/link2.old}}.  With resolved paths, I directly 
renamed {{/dir2/file}} to {{/dir2/file.old}}.  Then because of the duplicate 
path from list status, it tries again and fails because it's already been 
renamed.  Add on top that my symlinks are now broken/dangling.

+Copying+
I have {{/dir1/link}} pointing to {{/dir2/real}}.  I do a 
{{listStatus("/dir1")}} and intend to copy the files into {{/dir3}}.  Instead 
of {{/dir3/link}}, I get {{/dir3/real}}.

+Delete+
Again, I have a directory with a symlinks elsewhere.  I do a listStatus, walk 
the entries intending to delete paths that match some criteria.  While one 
would naturally expect the symlink to be deleted, the resolved path made me 
delete the target!!  That's destruction of data!

---

In short, despite the performance impact, the only correct thing to do is 
return the unresolved path.  Symlinks must be completely invisible to pre-2.x 
applications or all sorts of unexpected behavior will occur that may drop data, 
destroy data, etc.  Debugging these odd cases in production will be near 
possible, if they can even be detected in the first place.

As much as I want symlinks, the full impact of the implementation was not fully 
considered.  The current implementation is dangerous in many ways.

> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by 
> default
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-9984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9984
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs
>    Affects Versions: 2.1.0-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Blocker
>         Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch, 
> HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch, 
> HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch, 
> HADOOP-9984.013.patch, HADOOP-9984.014.patch, HADOOP-9984.015.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that 
> many existing HDFS clients would be broken by listStatus and globStatus 
> returning symlinks.  One example is applications that assume that 
> !FileStatus#isFile implies that the inode is a directory.  As we discussed in 
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning 
> resolved paths.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to