[
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783264#comment-13783264
]
Chris Nauroth commented on HADOOP-9984:
---------------------------------------
The code in the latest patch is looking good. I'm planning to give it a full
test run on Windows overnight in case there are any sneaky OS-specific issues.
A few comments/questions:
Failure to auto-resolve any symlink causes an exception for the whole
operation. There had been prior discussion of supporting an option to ignore
symlink resolution failures. Is that out of scope right now and coming later
in HADOOP-9972?
Nice job updating JavaDocs to describe the effects of symlinks on existing
methods. I'm going to take one more pass over this part, just to make sure we
covered everything.
Regarding the backwards-incompatible change of abstract listStatus to abstract
listLinkStatus, I also do not see a way to avoid this. At least this way, it's
only incompatible for subclass implementers and not callers.
Methods that perform auto-resolution will return multiple occurrences of the
same path if there are multiple symlinks with the same target. I haven't seen
this mentioned explicitly in the prior threads discussing compatibility
concerns, so I thought I'd bring it up. This decision can be significant for
apps. Taking the example of MapReduce running against HDFS,
{{FileInputFormat#getSplits}} runs {{FileSystem#globStatus}} and skips symlinks
(based on a length != 0) check. If {{FileSystem#globStatus}} returns symlinks,
they don't go into the job input. If the symlinks are auto-resolved (as in
this patch), then the same HDFS blocks get used multiple times to create
multiple input splits. According to comments in HADOOP-9912, {{globStatus}}
has been inconsistent over time, and I think auto-resolving yields the correct
expected behavior anyway. I have no objection to the change, but I wanted to
describe it clearly in case anyone else has concerns.
> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by
> default
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-9984
> URL: https://issues.apache.org/jira/browse/HADOOP-9984
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Affects Versions: 2.1.0-beta
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Priority: Blocker
> Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch,
> HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch,
> HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that
> many existing HDFS clients would be broken by listStatus and globStatus
> returning symlinks. One example is applications that assume that
> !FileStatus#isFile implies that the inode is a directory. As we discussed in
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning
> resolved paths.
--
This message was sent by Atlassian JIRA
(v6.1#6144)