[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-24 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776512#comment-13776512
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

bq. I mean listStatus(Path, PathOption) should call into listLinkStatus(it is 
HDFS::listStatus which is a primitive RPC call), not the other way around. I 
wonder how can we implement listStatus(Path, PathOption) without the primitive 
of listLinkStatus(Path)?

FileSystem#listStatus(Path, PathOption) should just be an abstract function 
which is implemented by DistributedFilesystem and other implementation classes. 
 DistributedFileSystem, and the other FileSystem implementations, need to get 
access to the other things in PathOption, such as the error handler.  Also, if 
we want to add more options in the future, we don't want to create 
listLinkStatusWithFoo and listLinkStatusWithFooAndBar.  Just listStatus(Path, 
PathOption).

I understand that bash globs ignore errors.  But that's not really a good 
reason why we shouldn't.  Hadoop and HDFS exist in an environment where there 
are unreliable networks.  So if globStatus swallows unresolved symlink errors, 
you could find yourself in a situation where your cross-filesystem symlink 
fails, and you silently operate on data that isn't what you think you're 
operating on.  There are also compatibility reasons not to ignore errors-- 
errors were not ignored in branch-1.  We discussed this on HADOOP-9929.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-24 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776870#comment-13776870
 ] 

Binglin Chang commented on HADOOP-9972:
---

bq. Also, if we want to add more options in the future, we don't want to create 
listLinkStatusWithFoo and listLinkStatusWithFooAndBar. Just listStatus(Path, 
PathOption).
That is exactly why I propose listStatus(Path, PathOption) implemented in 
FileSystem using more primitive listLinkStatus(Path), so If we add an option, 
we don't end up modify all sub FileSystems code. 

bq. we don't want to create listLinkStatusWithFoo and 
listLinkStatusWithFooAndBar. Just listStatus(Path, PathOption).
I am not against listStatus(Path, PathOption) API, just its implementation 
detail, this issue can be solved by listStatus(Path, PathOption). 

bq. Hadoop and HDFS exist in an environment where there are unreliable networks.
I don't think ignore all error including network issues, it is like disk 
failure/temporary unreadable issues in linux, globbing can't ignore that 
either, in that case error should just be passed all the way up to user, most 
user don't want to handle this error in ErrorHandler too.

bq. So if globStatus swallows unresolved symlink errors.
Are you saying network issue can cause unresolved symlink error? If dead link 
error is already mixed up with network errors, plus compatibility reasons, I 
agree with you, we can't follow linux practice.



 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-24 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777020#comment-13777020
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

I thought about this for a while, and I think adding a {{listLinkStatus}} 
function as you suggest might be a good idea.  It's kind of similar to 
{{getFileLinkStatus}}.  The nice thing about this approach is that filesystems 
which don't yet support symlinks can go back to the default approach of 
resolving all links (the same as getFileStatus).

bq. Are you saying network issue can cause unresolved symlink error?

Yes.  Symlinks can be cross-filesystem, and if one filesystem is unreachable, 
that would be a network error.

For globStatus, I'd like to do something similar to {{FileContext#create}}, 
where you have a varargs argument with {{CreateOpts}}.

One nice thing is that {{FileSystem#globStatus}} is not implemented by 
subclasses like {{listStatus}} is.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773255#comment-13773255
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

Hmm.  We could have a convenience method called {{listLinkStatus}} which just 
called into {{listStatus}} with the correct {{PathOptions}}.  I sort of lean 
towards fewer APIs rather than more, but maybe it makes sense.

Shell globbing doesn't ignore all errors, btw.
{code}
cmccabe@keter:~/mydir mkdir a
cmccabe@keter:~/mydir mkdir b
cmccabe@keter:~/mydir touch a/c
cmccabe@keter:~/mydir touch b/c
cmccabe@keter:~/mydir sudo chmod 000 b
root's password:
cmccabe@keter:~/mydir ls */c
a/c
cmccabe@keter:~/mydir ls b/c
ls: cannot access b/c: Permission denied
cmccabe@keter:~/mydir ls
a  b
cmccabe@keter:~/mydir ls *
a:
c
ls: cannot open directory b: Permission denied
{code}

It's interesting that it ignores the error in the case of {{ls */c}}, but not 
in the case of {{ls *}}.  Hadoop's shell would abort (and give no results) in 
both of those cases, which I think is suboptimal.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773417#comment-13773417
 ] 

Jason Lowe commented on HADOOP-9972:


+1 to the idea of having a new API where symlinks resolution and a per-entrhy 
error handler can be specified.  That should allow the client to cover all the 
three scenarios based on how the handler reacts to errors.  Just to be clear, 
what happens if the error handler does not rethrow the exception?  Is the entry 
removed from the listStatus results, returned as a raw symlink, or ...?  Is it 
controllable by the error handler?  I'm not sure if the difference between log 
exception and continue vs. ignore it completely is a different return code 
from the error handler method or just whether the handler logs or not.

bq. At first glance, I like extending the PathFilters.

That's a twist on the approach, not sure that's been proposed.  I suppose one 
could derive a new interface from PathFilter that becomes PathOptions and 
listStatus(Path, PathFilter) could check internally if it's actually got a 
PathOption instead of a PathFilter and behave differently.  However I think an 
explicit, separate API would be preferable though, simply for clarity of what 
the API expects from callers.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773471#comment-13773471
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

bq.Just to be clear, what happens if the error handler does not rethrow the 
exception?

If the error handler doesn't rethrow the exception, the listStatus / globStatus 
operation continues normally and returns the remaining results.  (We can't 
return the result that had the error.)  Unresolved symlinks are one type of 
error.  Whether to handle {{UnresolvedLinkException}} differently than other 
exceptions is up to the {{PathErrorHandler}} you provide.

bq. I'm not sure if the difference between log exception and continue vs. 
ignore it completely is a different return code from the error handler method 
or just whether the handler logs or not.

I was proposing that the logging happen inside the {{PathErrorHandler}}.  From 
the point of file of FileSystem / FileContext, all we care about is whether the 
{{PathErrorHandler}} rethrows the exception or not.  (We can provide a class 
implementing PathErrorHandler that logs to FileSystem#LOG if that is a common 
use case.)

bq.  I suppose one could derive a new interface from PathFilter that becomes 
PathOptions and listStatus(Path, PathFilter) could check internally if it's 
actually got a PathOption instead of a PathFilter and behave differently. 
However I think an explicit, separate API would be preferable though, simply 
for clarity of what the API expects from callers.

Yeah, I was proposing adding a new type, {{PathOptions}}, which could contain 
an instance of {{PathFilter}}.  We could add new methods to {{PathFilter}}, but 
since it's a public/stable interface rather than an abstract class, that would 
be an incompatible change.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-20 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773584#comment-13773584
 ] 

Binglin Chang commented on HADOOP-9972:
---

bq. Hmm. We could have a convenience method called listLinkStatus which just 
called into listStatus with the correct PathOptions. I sort of lean towards 
fewer APIs rather than more, but maybe it makes sense.
I mean listStatus(Path, PathOption) should call into listLinkStatus(it is 
HDFS::listStatus which is a primitive RPC call), not the other way around. I 
wonder how can we implement listStatus(Path, PathOption) without the primitive 
of listLinkStatus(Path)?

bq. Shell globbing doesn't ignore all errors
What I say of globbing is just shell wildcard substitution, it indeed ignore 
all errors, glob just substitute a string with wildcard to some string. 
http://www.linuxjournal.com/content/bash-extended-globbing
http://tldp.org/LDP/abs/html/globbingref.html
{code}
drwxr-xr-x  2 decster  staff  68 Sep 19 17:09 aa
drwxr-xr-x  2 decster  staff  68 Sep 19 17:12 bb
decster:~/projects/test echo *
aa bb
decster:~/projects/test echo */cc
*/cc
{code}

In your example:

{code}
cmccabe@keter:~/mydir ls b/c
ls: cannot access b/c: Permission denied
# this error is thrown by ls, not globbing

cmccabe@keter:~/mydir ls *
a:
c
ls: cannot open directory b: Permission denied
# ls * first become ls a c
# then ls throw the error when process c
{code}
 

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-19 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772074#comment-13772074
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

I guess I should talk about the motivation here.  Daryn Sharp, Kihwal Lee, 
Nathan Roberts, Eli Collins, Andrew Wang, and myself had a discussion about the 
new symlinks support in FileSystem in Hadoop 2.  The Yahoo! guys were concerned 
that if listStatus started returning symlinks, a lot of user code would break.  
One example is code that assumes that if FileStatus#isFile is false, then the 
inode is a directory.  Obviously, that's false in the case of symlinks.

To prevent this scenario, we want to change FileStatus#listStatus and 
FileStatus#globStatus to resolve all symlinks, and then provide an extended API 
for users who don't want that auto-resolve behavior.  That's what this 
discussion is about-- what that extended API should look like.

The discussion about whether HDFS should replace listStatus with something more 
like POSIX readdir seems like a tangent.  That's an interesting thing to 
discuss, but it doesn't really solve our problem in branch-2.1-beta, since 
there is still going to be code around that calls listStatus and globStatus for 
a long, long time.

This is a tangent, but I'm not even convinced that we should replace 
{{listStatus}} with {{readdir}}.  The reason why {{listStatus}} returns 
{{FileStatus[]}} rather than just a list of paths and file types is to minimize 
the number of network round trips to the NameNode.  That is still something we 
care about.  If you run {{/bin/ls}} with strace, you'll see that ls calls 
{{getdents}} (the implementation of readdir) and then makes an {{lstat}} call 
on each path name in the directory.  If the HDFS shell did the same thing, it 
would have to dramatically increase the number of RPCs it made to the NameNode.

Also see Jason Lowe's comment here: 
https://issues.apache.org/jira/browse/HADOOP-9912?focusedCommentId=13772002page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13772002

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-19 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772566#comment-13772566
 ] 

Binglin Chang commented on HADOOP-9972:
---

There are two issues we are talking about, one is new API:

bq. The discussion about whether HDFS should replace listStatus with something 
more like POSIX readdir seems like a tangent.
I think there is a confusion here, I didn't propose to use POSIX readdir. The 
API name readdir is probably causing confusion here so I changed to the 
listLinkStatus instead, it's semantics is the same as current hdfs listStatus 
which doesn't resolve links.

bq. To prevent this scenario, we want to change FileStatus#listStatus and 
FileStatus#globStatus to resolve all symlinks
I'am fully aware of this, and my proposal do not break this.

Frankly I don't see any conflict in the two proposals. I order to implement 
listStatus(Path, PathOption), a listLinkStatus(or something with the same 
semantics) primitive/core API is required, and it is mostly there(HDFS, other 
fs doesn't support symlink, except LocalFS). Since there is no conflict from my 
side, I think you can just submit the patch or give the implementation detail 
of listStatus(Path, PathOption) first. 

Another issue is globbing didn't follow linux practice:
It is probably a tangent, it is brought up just because the example about usage 
of PathErrorHandler. I say that Linux shell globbing ignore all errors, the 
example can be solved by following linux practice. If we decide not to follow 
linux practice and solve it another way, that is OK, although I prefer linux 
practice.



 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-19 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772570#comment-13772570
 ] 

Binglin Chang commented on HADOOP-9972:
---

You probably are confused by my earlier comments. I did not mean listLinkStatus 
only return filename and type. 

bq. Most linux/bsd system, readdir return filename and type.
I mean linux readdir in my comments, not the core API listLinkStatus. 


 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-18 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770471#comment-13770471
 ] 

Binglin Chang commented on HADOOP-9972:
---

Hi Colin, 
About globStatus example, if we follow linux practice, globStatus(p) = 
glob(pattern).map(path = getFileStatus(path))
String [] glob(pattern):
  if matches none, return pattern
  else return matched paths
  ignore all exceptions

I did some experiments, you can see ls * indeed should error message, but ls 
*/stuff should not show error message.
{code}
[root@master01 test]# mkdir -p aa/cc/foo
[root@master01 test]# mkdir -p bb/cc/foo
[root@master01 test]# chmod 700 bb
[root@master01 test]# ll /home/serengeti/.bash
[root@master01 test]# su serengeti
[serengeti@master01 test]$ ll
total 8
drwxr-xr-x 3 root root 4096 Sep 18 08:30 aa
drwx-- 3 root root 4096 Sep 18 08:31 bb
[serengeti@master01 test]$ ls *
aa:
cc
ls: bb: Permission denied
[serengeti@master01 test]$ ls */cc
foo
{code}

Separate globStatus to glob and getFileStatus seems a more proper way of doing 
globStatus rather than add new classes/interface and callback handler, and this 
is linux practice, should be more robust.







 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-18 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770496#comment-13770496
 ] 

Binglin Chang commented on HADOOP-9972:
---

Regarding API, I think we should differentiate core API and extend/legacy API, 
IMO, there should be 3 core API:

getFileStatus  resolve symlink
getFileLinkStatus don't resolve symlink
readdir   don't resolve symlink, just like current HDFS listStatus

These core API should be implemented in each FS

All other related APIs can be build based on core API and implemented in 
FSContext/FileSystem once for all:
{code}
FS.listStatus(path):
  readdir(path).map(s = if (s.isSymlink) getFileStatus ignore Exception else s)

FS.listStatus(path, PathOptions):
   readdir(path).map(process PathOptions)

glob(pattern):
  if pattern matches none, return pattern
  else return matched paths
  ignore all exceptions

globStatus(pattern):
  glob(pattern).map(getFileStatus)
{code}


 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-18 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770532#comment-13770532
 ] 

Binglin Chang commented on HADOOP-9972:
---

Perhaps listLinkStatus is a better name for readdir.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-18 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771492#comment-13771492
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

bq. I did some experiments, you can see ls * indeed should error message, but 
ls */stuff should not show error message.

I'm afraid that what you're seeing is a bug.  I introduced this bug and I have 
a patch available to fix it: https://issues.apache.org/jira/browse/HADOOP-9929

This bug is also not in branch-2.1-beta, so if you'd like to see what the 
current correct behavior of globStatus is, try that branch.  You can also try 
branch-1.

bq. [listLinkStatus proposal]

I want to avoid a combinatorial explosion of function overloads.

Right now we have {{FileSystem#listStatus(Path)}}, 
{{FileSystem#listStatus(Path, PathFilter)}}, {{FileSystem#listStatus(Path[])}}, 
and {{FileSystem#listStatus(Path[], PathFilter filter)}}.  If we create 
{{listLinkStatus}} as you proposed, that multiplies the number of functions in 
FileSystem by 2x, since we have to create a {{listLinkStatus}} equivalent for 
each of these.

It's much cleaner to fold the {{PathFilter}} into a {{PathOptions}} class, I 
think.  That only requires adding two new functions to FileSystem:  
{{FileSystem#listStatus(Path, PathOptions)}} and 
{{FileSystem#listStatus(Path[], PathOptions)}}.

With regard to {{globStatus}}, you can't build what we want on top of what we 
have now.  The first IOException we hit will cause the globStatus function to 
abort.  Clients like the shell, which want to handle errors differently, simply 
don't get a chance to do so with the current API.

bq. Separate globStatus to glob and getFileStatus seems a more proper way of 
doing globStatus rather than add new classes/interface and callback handler, 
and this is linux practice, should be more robust

The Linux practice is based on the fact that {{readdir}} only returns path 
names (i.e. strings) in POSIX.  In HDFS and other Hadoop filesystems, we don't 
have {{readdir}}, only {{getFileStatus}} and {{getFileLinkStatus}}, which 
return lists of {{FileStatus}} objects.

Since we're already dealing with {{FileStatus}} objects, it makes no sense to 
call {{getFileStatus}} on them again-- it's a pure waste of computer time.  You 
also need some way of handling errors encountered in globStatus besides 
ignoring them or aborting the whole glob.  See HADOOP-9929 for more commentary 
on this issue.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-18 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771563#comment-13771563
 ] 

Binglin Chang commented on HADOOP-9972:
---

bq. I'm afraid that what you're seeing is a bug. I introduced this bug and I 
have a patch available to fix it: 
https://issues.apache.org/jira/browse/HADOOP-9929
The experiment is done in linux, I was talking about linux practice, the 
practice is glob ignore all permission/dangling errors, then ls handle errors 
properly. We should better follow linux practice, I don't see it is related to 
HADOOP-9929.
I think the correct fix for HADOOP-9929 should be:
{code}
hadoop fs -ls /user/abc/tests/data
  glob(/user/abc/tests/data)
pattern matches nothing because of permission issue, so just return 
[/user/abc/tests/data]
  ls [/user/abc/tests/data] return permission error
{code}

bq. I want to avoid a combinatorial explosion of function overloads.
There is no combinatorial explosion, every fs already has a listStatus 
implementation, if the fs support symlink(to my knowledge only LocalFS and HDFS 
support symlink), we add listLinkStatus(for HDFS, just rename listStatus to 
listLinkStatus), if the fs does not support symlink, by default listStatus = 
listLinkStatus, the change is minimal. All other non core API(listStatus(Path), 
listStatus(Path, PathFilter), listStatus(Path[]), listStatus(Path[], 
PathFilter), listStatus(Path, PathOption)) should only implemented in FS/FC)

listStatus(Path, PathOption) doesn't like a core API, core API should be 
minimal, orthogonal, and complete. listStatus(Path, PathOption) in the end 
still need readdir/getLinkStatus equivalent to implement. 

bq. The Linux practice is based on the fact that readdir only returns path 
names (i.e. strings) in POSIX
Most linux/bsd system, readdir return filename and type.
http://man7.org/linux/man-pages/man3/readdir.3.html








 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-18 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771565#comment-13771565
 ] 

Binglin Chang commented on HADOOP-9972:
---

bq. listStatus(Path, PathOption) doesn't like a core API, core API should be 
minimal, orthogonal, and complete. listStatus(Path, PathOption) in the end 
still need readdir/getLinkStatus equivalent to implement.
Sorry, getLinkStatus should be listLinkStatus

Another way of saying this is that readdir/listLinkStatus are already 
there(LocalFS need some change, HDFS already have listStatus, FS not support 
symlink listStatus==listLinkStatus), but for compatibility reason, we can use 
the name listStatus anymore, so just change it to use another name.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-18 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771569#comment-13771569
 ] 

Binglin Chang commented on HADOOP-9972:
---

bq. Since we're already dealing with FileStatus objects, it makes no sense to 
call getFileStatus on them again-- it's a pure waste of computer time.
It is bad we don't have readdir to only get inode name and type, but it is the 
way shell globbing works, correctness is before efficiency, we can combine the 
2 steps together for optimization as long as it is correct. 


 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769646#comment-13769646
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

I think we can probably let {{FileContext#listStatus}} and 
{{FileContext#Util#globStatus}} default to *not* fully resolving symlinks.  
This makes sense, since {{FileContext}} has had symlink support  for a long 
time, and doesn't have as much legacy code relying on it.

We also probably need some way of sensibly handling errors in globStatus.  
Right now, we really only have the choice of ignoring the error, and throwing 
an exception which ends the whole globStatus.  We should add some options.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769653#comment-13769653
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

Proposed new APIs (in FileSystem and FileContext):
{code}
FileStatus[] listStatus(Path path, PathOptions options) throws IOException;
FileStatus[] globStatus(Path path, PathOptions options) throws IOException;
{code}

The {{PathOptions}} class will contain three fields:
{code}
  private PathFilter pathFilter;
  private PathErrorHandler errorHandler;
  private Boolean resolveSymlinks;
{code}

{{PathFilter}} serves the same purpose that it currently does-- filtering out 
paths from the results.

{{PathErrorHandler}} has a {{handleError}} function taking a {{Path}} and 
{{IOException}}.  This function gets invoked whenever there is an IOException.  
It can choose to rethrow the exception,  log the exception and continue, or 
simply ignore it completely.

{{resolveSymlinks}} determines whether we should fully resolve all symlinks 
that we come across.  If it is set, we will never get back a FileStatus for a 
symlink from either {{listStatus}} or {{globStatus}}.

We can add more fields to {{PathOptions}} later if it becomes necessary.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks

2013-09-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769661#comment-13769661
 ] 

Colin Patrick McCabe commented on HADOOP-9972:
--

I guess I should add a few words about why {{PathErrorHandler}} is necessary.  
Basically, we want to give users of {{globStatus}} flexibility.

For example, let's say you have the following directories:
/a owned by superuser, mode 
/b owned by bob, mode 0777

Bob would like to be able to get back a result from {{globStatus(/\*/stuff)}}, 
not just an AccessControlException (which came out of trying to access 
/a/stuff).  But bob also doesn't necessarily want to ignore the 
AccessControlException completely.  He wants something like the  behavior of 
GNU ls, which will print out an error message to stderr about paths it can't 
access, but still continue to list the remaining paths which it can.  
Currently, bob can't get this-- he simply gets an IOException and *no* 
globStatus results.  Ignoring the error completely also seems like the wrong 
thing to do as well, though.  Hence the {{PathErrorHandler}}, which allows more 
sophisticated error handling here.

Symlinks make this more important, since you have errors like 
{{UnresolvedPathException}}, which anyone can cause simply by creating a 
dangling symlink.  We don't want directories with dangling symlinks to become 
un-globbable.  Obviously, the default error handlers will provide the existing 
behavior for {{listStatus}} and {{globStatus}}.

 new APIs for listStatus and globStatus to deal with symlinks
 

 Key: HADOOP-9972
 URL: https://issues.apache.org/jira/browse/HADOOP-9972
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.1.1-beta
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to 
 deal with symlinks.  The issue is that code has been written which is 
 incompatible with the existence of things which are not files or directories. 
  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.
 It seems reasonable to make the default behavior of {{FileSystem#listStatus}} 
 and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring 
 dangling ones.  This will prevent incompatibility with existing MR jobs and 
 other HDFS users.  We should also add new versions of listStatus and 
 globStatus that allow new, symlink-aware code to deal with symlinks as 
 symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira