[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776512#comment-13776512 ] Colin Patrick McCabe commented on HADOOP-9972: -- bq. I mean listStatus(Path, PathOption) should call into listLinkStatus(it is HDFS::listStatus which is a primitive RPC call), not the other way around. I wonder how can we implement listStatus(Path, PathOption) without the primitive of listLinkStatus(Path)? FileSystem#listStatus(Path, PathOption) should just be an abstract function which is implemented by DistributedFilesystem and other implementation classes. DistributedFileSystem, and the other FileSystem implementations, need to get access to the other things in PathOption, such as the error handler. Also, if we want to add more options in the future, we don't want to create listLinkStatusWithFoo and listLinkStatusWithFooAndBar. Just listStatus(Path, PathOption). I understand that bash globs ignore errors. But that's not really a good reason why we shouldn't. Hadoop and HDFS exist in an environment where there are unreliable networks. So if globStatus swallows unresolved symlink errors, you could find yourself in a situation where your cross-filesystem symlink fails, and you silently operate on data that isn't what you think you're operating on. There are also compatibility reasons not to ignore errors-- errors were not ignored in branch-1. We discussed this on HADOOP-9929. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776870#comment-13776870 ] Binglin Chang commented on HADOOP-9972: --- bq. Also, if we want to add more options in the future, we don't want to create listLinkStatusWithFoo and listLinkStatusWithFooAndBar. Just listStatus(Path, PathOption). That is exactly why I propose listStatus(Path, PathOption) implemented in FileSystem using more primitive listLinkStatus(Path), so If we add an option, we don't end up modify all sub FileSystems code. bq. we don't want to create listLinkStatusWithFoo and listLinkStatusWithFooAndBar. Just listStatus(Path, PathOption). I am not against listStatus(Path, PathOption) API, just its implementation detail, this issue can be solved by listStatus(Path, PathOption). bq. Hadoop and HDFS exist in an environment where there are unreliable networks. I don't think ignore all error including network issues, it is like disk failure/temporary unreadable issues in linux, globbing can't ignore that either, in that case error should just be passed all the way up to user, most user don't want to handle this error in ErrorHandler too. bq. So if globStatus swallows unresolved symlink errors. Are you saying network issue can cause unresolved symlink error? If dead link error is already mixed up with network errors, plus compatibility reasons, I agree with you, we can't follow linux practice. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777020#comment-13777020 ] Colin Patrick McCabe commented on HADOOP-9972: -- I thought about this for a while, and I think adding a {{listLinkStatus}} function as you suggest might be a good idea. It's kind of similar to {{getFileLinkStatus}}. The nice thing about this approach is that filesystems which don't yet support symlinks can go back to the default approach of resolving all links (the same as getFileStatus). bq. Are you saying network issue can cause unresolved symlink error? Yes. Symlinks can be cross-filesystem, and if one filesystem is unreachable, that would be a network error. For globStatus, I'd like to do something similar to {{FileContext#create}}, where you have a varargs argument with {{CreateOpts}}. One nice thing is that {{FileSystem#globStatus}} is not implemented by subclasses like {{listStatus}} is. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773255#comment-13773255 ] Colin Patrick McCabe commented on HADOOP-9972: -- Hmm. We could have a convenience method called {{listLinkStatus}} which just called into {{listStatus}} with the correct {{PathOptions}}. I sort of lean towards fewer APIs rather than more, but maybe it makes sense. Shell globbing doesn't ignore all errors, btw. {code} cmccabe@keter:~/mydir mkdir a cmccabe@keter:~/mydir mkdir b cmccabe@keter:~/mydir touch a/c cmccabe@keter:~/mydir touch b/c cmccabe@keter:~/mydir sudo chmod 000 b root's password: cmccabe@keter:~/mydir ls */c a/c cmccabe@keter:~/mydir ls b/c ls: cannot access b/c: Permission denied cmccabe@keter:~/mydir ls a b cmccabe@keter:~/mydir ls * a: c ls: cannot open directory b: Permission denied {code} It's interesting that it ignores the error in the case of {{ls */c}}, but not in the case of {{ls *}}. Hadoop's shell would abort (and give no results) in both of those cases, which I think is suboptimal. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773417#comment-13773417 ] Jason Lowe commented on HADOOP-9972: +1 to the idea of having a new API where symlinks resolution and a per-entrhy error handler can be specified. That should allow the client to cover all the three scenarios based on how the handler reacts to errors. Just to be clear, what happens if the error handler does not rethrow the exception? Is the entry removed from the listStatus results, returned as a raw symlink, or ...? Is it controllable by the error handler? I'm not sure if the difference between log exception and continue vs. ignore it completely is a different return code from the error handler method or just whether the handler logs or not. bq. At first glance, I like extending the PathFilters. That's a twist on the approach, not sure that's been proposed. I suppose one could derive a new interface from PathFilter that becomes PathOptions and listStatus(Path, PathFilter) could check internally if it's actually got a PathOption instead of a PathFilter and behave differently. However I think an explicit, separate API would be preferable though, simply for clarity of what the API expects from callers. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773471#comment-13773471 ] Colin Patrick McCabe commented on HADOOP-9972: -- bq.Just to be clear, what happens if the error handler does not rethrow the exception? If the error handler doesn't rethrow the exception, the listStatus / globStatus operation continues normally and returns the remaining results. (We can't return the result that had the error.) Unresolved symlinks are one type of error. Whether to handle {{UnresolvedLinkException}} differently than other exceptions is up to the {{PathErrorHandler}} you provide. bq. I'm not sure if the difference between log exception and continue vs. ignore it completely is a different return code from the error handler method or just whether the handler logs or not. I was proposing that the logging happen inside the {{PathErrorHandler}}. From the point of file of FileSystem / FileContext, all we care about is whether the {{PathErrorHandler}} rethrows the exception or not. (We can provide a class implementing PathErrorHandler that logs to FileSystem#LOG if that is a common use case.) bq. I suppose one could derive a new interface from PathFilter that becomes PathOptions and listStatus(Path, PathFilter) could check internally if it's actually got a PathOption instead of a PathFilter and behave differently. However I think an explicit, separate API would be preferable though, simply for clarity of what the API expects from callers. Yeah, I was proposing adding a new type, {{PathOptions}}, which could contain an instance of {{PathFilter}}. We could add new methods to {{PathFilter}}, but since it's a public/stable interface rather than an abstract class, that would be an incompatible change. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773584#comment-13773584 ] Binglin Chang commented on HADOOP-9972: --- bq. Hmm. We could have a convenience method called listLinkStatus which just called into listStatus with the correct PathOptions. I sort of lean towards fewer APIs rather than more, but maybe it makes sense. I mean listStatus(Path, PathOption) should call into listLinkStatus(it is HDFS::listStatus which is a primitive RPC call), not the other way around. I wonder how can we implement listStatus(Path, PathOption) without the primitive of listLinkStatus(Path)? bq. Shell globbing doesn't ignore all errors What I say of globbing is just shell wildcard substitution, it indeed ignore all errors, glob just substitute a string with wildcard to some string. http://www.linuxjournal.com/content/bash-extended-globbing http://tldp.org/LDP/abs/html/globbingref.html {code} drwxr-xr-x 2 decster staff 68 Sep 19 17:09 aa drwxr-xr-x 2 decster staff 68 Sep 19 17:12 bb decster:~/projects/test echo * aa bb decster:~/projects/test echo */cc */cc {code} In your example: {code} cmccabe@keter:~/mydir ls b/c ls: cannot access b/c: Permission denied # this error is thrown by ls, not globbing cmccabe@keter:~/mydir ls * a: c ls: cannot open directory b: Permission denied # ls * first become ls a c # then ls throw the error when process c {code} new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772074#comment-13772074 ] Colin Patrick McCabe commented on HADOOP-9972: -- I guess I should talk about the motivation here. Daryn Sharp, Kihwal Lee, Nathan Roberts, Eli Collins, Andrew Wang, and myself had a discussion about the new symlinks support in FileSystem in Hadoop 2. The Yahoo! guys were concerned that if listStatus started returning symlinks, a lot of user code would break. One example is code that assumes that if FileStatus#isFile is false, then the inode is a directory. Obviously, that's false in the case of symlinks. To prevent this scenario, we want to change FileStatus#listStatus and FileStatus#globStatus to resolve all symlinks, and then provide an extended API for users who don't want that auto-resolve behavior. That's what this discussion is about-- what that extended API should look like. The discussion about whether HDFS should replace listStatus with something more like POSIX readdir seems like a tangent. That's an interesting thing to discuss, but it doesn't really solve our problem in branch-2.1-beta, since there is still going to be code around that calls listStatus and globStatus for a long, long time. This is a tangent, but I'm not even convinced that we should replace {{listStatus}} with {{readdir}}. The reason why {{listStatus}} returns {{FileStatus[]}} rather than just a list of paths and file types is to minimize the number of network round trips to the NameNode. That is still something we care about. If you run {{/bin/ls}} with strace, you'll see that ls calls {{getdents}} (the implementation of readdir) and then makes an {{lstat}} call on each path name in the directory. If the HDFS shell did the same thing, it would have to dramatically increase the number of RPCs it made to the NameNode. Also see Jason Lowe's comment here: https://issues.apache.org/jira/browse/HADOOP-9912?focusedCommentId=13772002page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13772002 new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772566#comment-13772566 ] Binglin Chang commented on HADOOP-9972: --- There are two issues we are talking about, one is new API: bq. The discussion about whether HDFS should replace listStatus with something more like POSIX readdir seems like a tangent. I think there is a confusion here, I didn't propose to use POSIX readdir. The API name readdir is probably causing confusion here so I changed to the listLinkStatus instead, it's semantics is the same as current hdfs listStatus which doesn't resolve links. bq. To prevent this scenario, we want to change FileStatus#listStatus and FileStatus#globStatus to resolve all symlinks I'am fully aware of this, and my proposal do not break this. Frankly I don't see any conflict in the two proposals. I order to implement listStatus(Path, PathOption), a listLinkStatus(or something with the same semantics) primitive/core API is required, and it is mostly there(HDFS, other fs doesn't support symlink, except LocalFS). Since there is no conflict from my side, I think you can just submit the patch or give the implementation detail of listStatus(Path, PathOption) first. Another issue is globbing didn't follow linux practice: It is probably a tangent, it is brought up just because the example about usage of PathErrorHandler. I say that Linux shell globbing ignore all errors, the example can be solved by following linux practice. If we decide not to follow linux practice and solve it another way, that is OK, although I prefer linux practice. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772570#comment-13772570 ] Binglin Chang commented on HADOOP-9972: --- You probably are confused by my earlier comments. I did not mean listLinkStatus only return filename and type. bq. Most linux/bsd system, readdir return filename and type. I mean linux readdir in my comments, not the core API listLinkStatus. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770471#comment-13770471 ] Binglin Chang commented on HADOOP-9972: --- Hi Colin, About globStatus example, if we follow linux practice, globStatus(p) = glob(pattern).map(path = getFileStatus(path)) String [] glob(pattern): if matches none, return pattern else return matched paths ignore all exceptions I did some experiments, you can see ls * indeed should error message, but ls */stuff should not show error message. {code} [root@master01 test]# mkdir -p aa/cc/foo [root@master01 test]# mkdir -p bb/cc/foo [root@master01 test]# chmod 700 bb [root@master01 test]# ll /home/serengeti/.bash [root@master01 test]# su serengeti [serengeti@master01 test]$ ll total 8 drwxr-xr-x 3 root root 4096 Sep 18 08:30 aa drwx-- 3 root root 4096 Sep 18 08:31 bb [serengeti@master01 test]$ ls * aa: cc ls: bb: Permission denied [serengeti@master01 test]$ ls */cc foo {code} Separate globStatus to glob and getFileStatus seems a more proper way of doing globStatus rather than add new classes/interface and callback handler, and this is linux practice, should be more robust. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770496#comment-13770496 ] Binglin Chang commented on HADOOP-9972: --- Regarding API, I think we should differentiate core API and extend/legacy API, IMO, there should be 3 core API: getFileStatus resolve symlink getFileLinkStatus don't resolve symlink readdir don't resolve symlink, just like current HDFS listStatus These core API should be implemented in each FS All other related APIs can be build based on core API and implemented in FSContext/FileSystem once for all: {code} FS.listStatus(path): readdir(path).map(s = if (s.isSymlink) getFileStatus ignore Exception else s) FS.listStatus(path, PathOptions): readdir(path).map(process PathOptions) glob(pattern): if pattern matches none, return pattern else return matched paths ignore all exceptions globStatus(pattern): glob(pattern).map(getFileStatus) {code} new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770532#comment-13770532 ] Binglin Chang commented on HADOOP-9972: --- Perhaps listLinkStatus is a better name for readdir. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771492#comment-13771492 ] Colin Patrick McCabe commented on HADOOP-9972: -- bq. I did some experiments, you can see ls * indeed should error message, but ls */stuff should not show error message. I'm afraid that what you're seeing is a bug. I introduced this bug and I have a patch available to fix it: https://issues.apache.org/jira/browse/HADOOP-9929 This bug is also not in branch-2.1-beta, so if you'd like to see what the current correct behavior of globStatus is, try that branch. You can also try branch-1. bq. [listLinkStatus proposal] I want to avoid a combinatorial explosion of function overloads. Right now we have {{FileSystem#listStatus(Path)}}, {{FileSystem#listStatus(Path, PathFilter)}}, {{FileSystem#listStatus(Path[])}}, and {{FileSystem#listStatus(Path[], PathFilter filter)}}. If we create {{listLinkStatus}} as you proposed, that multiplies the number of functions in FileSystem by 2x, since we have to create a {{listLinkStatus}} equivalent for each of these. It's much cleaner to fold the {{PathFilter}} into a {{PathOptions}} class, I think. That only requires adding two new functions to FileSystem: {{FileSystem#listStatus(Path, PathOptions)}} and {{FileSystem#listStatus(Path[], PathOptions)}}. With regard to {{globStatus}}, you can't build what we want on top of what we have now. The first IOException we hit will cause the globStatus function to abort. Clients like the shell, which want to handle errors differently, simply don't get a chance to do so with the current API. bq. Separate globStatus to glob and getFileStatus seems a more proper way of doing globStatus rather than add new classes/interface and callback handler, and this is linux practice, should be more robust The Linux practice is based on the fact that {{readdir}} only returns path names (i.e. strings) in POSIX. In HDFS and other Hadoop filesystems, we don't have {{readdir}}, only {{getFileStatus}} and {{getFileLinkStatus}}, which return lists of {{FileStatus}} objects. Since we're already dealing with {{FileStatus}} objects, it makes no sense to call {{getFileStatus}} on them again-- it's a pure waste of computer time. You also need some way of handling errors encountered in globStatus besides ignoring them or aborting the whole glob. See HADOOP-9929 for more commentary on this issue. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771563#comment-13771563 ] Binglin Chang commented on HADOOP-9972: --- bq. I'm afraid that what you're seeing is a bug. I introduced this bug and I have a patch available to fix it: https://issues.apache.org/jira/browse/HADOOP-9929 The experiment is done in linux, I was talking about linux practice, the practice is glob ignore all permission/dangling errors, then ls handle errors properly. We should better follow linux practice, I don't see it is related to HADOOP-9929. I think the correct fix for HADOOP-9929 should be: {code} hadoop fs -ls /user/abc/tests/data glob(/user/abc/tests/data) pattern matches nothing because of permission issue, so just return [/user/abc/tests/data] ls [/user/abc/tests/data] return permission error {code} bq. I want to avoid a combinatorial explosion of function overloads. There is no combinatorial explosion, every fs already has a listStatus implementation, if the fs support symlink(to my knowledge only LocalFS and HDFS support symlink), we add listLinkStatus(for HDFS, just rename listStatus to listLinkStatus), if the fs does not support symlink, by default listStatus = listLinkStatus, the change is minimal. All other non core API(listStatus(Path), listStatus(Path, PathFilter), listStatus(Path[]), listStatus(Path[], PathFilter), listStatus(Path, PathOption)) should only implemented in FS/FC) listStatus(Path, PathOption) doesn't like a core API, core API should be minimal, orthogonal, and complete. listStatus(Path, PathOption) in the end still need readdir/getLinkStatus equivalent to implement. bq. The Linux practice is based on the fact that readdir only returns path names (i.e. strings) in POSIX Most linux/bsd system, readdir return filename and type. http://man7.org/linux/man-pages/man3/readdir.3.html new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771565#comment-13771565 ] Binglin Chang commented on HADOOP-9972: --- bq. listStatus(Path, PathOption) doesn't like a core API, core API should be minimal, orthogonal, and complete. listStatus(Path, PathOption) in the end still need readdir/getLinkStatus equivalent to implement. Sorry, getLinkStatus should be listLinkStatus Another way of saying this is that readdir/listLinkStatus are already there(LocalFS need some change, HDFS already have listStatus, FS not support symlink listStatus==listLinkStatus), but for compatibility reason, we can use the name listStatus anymore, so just change it to use another name. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771569#comment-13771569 ] Binglin Chang commented on HADOOP-9972: --- bq. Since we're already dealing with FileStatus objects, it makes no sense to call getFileStatus on them again-- it's a pure waste of computer time. It is bad we don't have readdir to only get inode name and type, but it is the way shell globbing works, correctness is before efficiency, we can combine the 2 steps together for optimization as long as it is correct. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769646#comment-13769646 ] Colin Patrick McCabe commented on HADOOP-9972: -- I think we can probably let {{FileContext#listStatus}} and {{FileContext#Util#globStatus}} default to *not* fully resolving symlinks. This makes sense, since {{FileContext}} has had symlink support for a long time, and doesn't have as much legacy code relying on it. We also probably need some way of sensibly handling errors in globStatus. Right now, we really only have the choice of ignoring the error, and throwing an exception which ends the whole globStatus. We should add some options. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769653#comment-13769653 ] Colin Patrick McCabe commented on HADOOP-9972: -- Proposed new APIs (in FileSystem and FileContext): {code} FileStatus[] listStatus(Path path, PathOptions options) throws IOException; FileStatus[] globStatus(Path path, PathOptions options) throws IOException; {code} The {{PathOptions}} class will contain three fields: {code} private PathFilter pathFilter; private PathErrorHandler errorHandler; private Boolean resolveSymlinks; {code} {{PathFilter}} serves the same purpose that it currently does-- filtering out paths from the results. {{PathErrorHandler}} has a {{handleError}} function taking a {{Path}} and {{IOException}}. This function gets invoked whenever there is an IOException. It can choose to rethrow the exception, log the exception and continue, or simply ignore it completely. {{resolveSymlinks}} determines whether we should fully resolve all symlinks that we come across. If it is set, we will never get back a FileStatus for a symlink from either {{listStatus}} or {{globStatus}}. We can add more fields to {{PathOptions}} later if it becomes necessary. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
[ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769661#comment-13769661 ] Colin Patrick McCabe commented on HADOOP-9972: -- I guess I should add a few words about why {{PathErrorHandler}} is necessary. Basically, we want to give users of {{globStatus}} flexibility. For example, let's say you have the following directories: /a owned by superuser, mode /b owned by bob, mode 0777 Bob would like to be able to get back a result from {{globStatus(/\*/stuff)}}, not just an AccessControlException (which came out of trying to access /a/stuff). But bob also doesn't necessarily want to ignore the AccessControlException completely. He wants something like the behavior of GNU ls, which will print out an error message to stderr about paths it can't access, but still continue to list the remaining paths which it can. Currently, bob can't get this-- he simply gets an IOException and *no* globStatus results. Ignoring the error completely also seems like the wrong thing to do as well, though. Hence the {{PathErrorHandler}}, which allows more sophisticated error handling here. Symlinks make this more important, since you have errors like {{UnresolvedPathException}}, which anyone can cause simply by creating a dangling symlink. We don't want directories with dangling symlinks to become un-globbable. Obviously, the default error handlers will provide the existing behavior for {{listStatus}} and {{globStatus}}. new APIs for listStatus and globStatus to deal with symlinks Key: HADOOP-9972 URL: https://issues.apache.org/jira/browse/HADOOP-9972 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 2.1.1-beta Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with symlinks. The issue is that code has been written which is incompatible with the existence of things which are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}} be fully resolving symlinks, and ignoring dangling ones. This will prevent incompatibility with existing MR jobs and other HDFS users. We should also add new versions of listStatus and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira