[ 
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783264#comment-13783264
 ] 

Chris Nauroth commented on HADOOP-9984:
---------------------------------------

The code in the latest patch is looking good.  I'm planning to give it a full 
test run on Windows overnight in case there are any sneaky OS-specific issues.  
A few comments/questions:

Failure to auto-resolve any symlink causes an exception for the whole 
operation.  There had been prior discussion of supporting an option to ignore 
symlink resolution failures.  Is that out of scope right now and coming later 
in HADOOP-9972?

Nice job updating JavaDocs to describe the effects of symlinks on existing 
methods.  I'm going to take one more pass over this part, just to make sure we 
covered everything.

Regarding the backwards-incompatible change of abstract listStatus to abstract 
listLinkStatus, I also do not see a way to avoid this.  At least this way, it's 
only incompatible for subclass implementers and not callers.

Methods that perform auto-resolution will return multiple occurrences of the 
same path if there are multiple symlinks with the same target.  I haven't seen 
this mentioned explicitly in the prior threads discussing compatibility 
concerns, so I thought I'd bring it up.  This decision can be significant for 
apps.  Taking the example of MapReduce running against HDFS, 
{{FileInputFormat#getSplits}} runs {{FileSystem#globStatus}} and skips symlinks 
(based on a length != 0) check.  If {{FileSystem#globStatus}} returns symlinks, 
they don't go into the job input.  If the symlinks are auto-resolved (as in 
this patch), then the same HDFS blocks get used multiple times to create 
multiple input splits.  According to comments in HADOOP-9912, {{globStatus}} 
has been inconsistent over time, and I think auto-resolving yields the correct 
expected behavior anyway.  I have no objection to the change, but I wanted to 
describe it clearly in case anyone else has concerns.


> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by 
> default
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-9984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9984
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.1.0-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Blocker
>         Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch, 
> HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch, 
> HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that 
> many existing HDFS clients would be broken by listStatus and globStatus 
> returning symlinks.  One example is applications that assume that 
> !FileStatus#isFile implies that the inode is a directory.  As we discussed in 
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning 
> resolved paths.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to