Proposed function path_in_directory() [was: Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths]
On 10/01/2012 06:51 AM, Michael Haggerty wrote: > I think I would advocate that the prefix has to match the front of the > path exactly (including any trailing slashes) and either > > strlen(prefix) == 0 > or the prefix ended with a '/' > or the prefix and path are identical > or the character in path following the matching part is a '/' > > This would allow the "is path its own prefix" policy to be decided by > the caller by either including or omitting a trailing slash on the > prefix argument. Thinking about this more, I don't think it will work. As usual, the special cases around "/" and "//" make things awkward. I think it is necessary to have a separate argument to specify whether "path is its own prefix". So I am trying to decide how a function path_in_directory() should work, and would like to get some feedback, especially on the following two points: 1. How should "//" be handled? I don't really have experience with the peculiar paths that start with "//", so I'm not sure how they should be handled (or even if the handling needs to be platform-dependent). My working hypothesis is that the inputs should be normalized by the caller, so if the caller wants "//" to be treated as equivalent to "/" then the caller should normalize them *before* calling this function. Conversely, if the caller passes "//" to the function, that implies that "//" is distinct from "/" and "//" is considered a proper subdirectory of "/". See the cases marked with "??" below. 2. Does there need to be any special related to DOS paths? > /* > * Return true iff path is within dir. The comparison is textual, > * meaning that path and dir should be normalized and either both be > * absolute or both be relative to the same directory. If path and > * dir represent the *same* path, then return true iff allow_equal is > * true. Single trailing slashes on either path or dir are ignored, > * (except for the special case "//"); i.e., "a/b" and "a/b/" are > * treated equivalently, as are "" and "/". Examples (* means "don't > * care"): > * > * - path_in_directory("a/b", "a", *) -> true > * - path_in_directory("a", "a/b", *) -> false > * - path_in_directory("ab", "a", *) -> false > * - path_in_directory("a/b", "a/b", 0) -> false > * (same if either argument is replaced with "a/b/") > * - path_in_directory("a/b", "a/b", 1) -> true > * (same if either argument is replaced with "a/b/") > * - path_in_directory(*, "/", 1) -> true > * - path_in_directory("/", "/", 0) -> false > * - path_in_directory("//", "/", 0) -> true?? > * - path_in_directory("//", "/", 1) -> true > * - path_in_directory("/", "//", 0) -> false > * - path_in_directory("/", "//", 1) -> false ?? > * - path_in_directory("/a/b", "//", *) -> false > */ > int path_in_directory(const char *path, const char *dir, int allow_equal); Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths
Michael Haggerty writes: > I think I would advocate that the prefix has to match the front of > the path exactly (including any trailing slashes) and either > > strlen(prefix) == 0 > or the prefix ended with a '/' > or the prefix and path are identical > or the character in path following the matching part is a '/' > > This would allow the "is path its own prefix" policy to be decided by > the caller by either including or omitting a trailing slash on the > prefix argument. I think that is sensible thing to do. The primary thing I found questionable was that the function, given "/net/wink/project/frotz" to check against "/pub:/s" (or "/pub/:/s/" if you like), will report that "/net/wink/project" directory is the longest ancestor, when "/s" is a symlink that happens to point at "/net/wink/project". It is very counter-intuitive when you view its two input strings as strings. By making its sole caller expand the symbolic links, it would be a lot clearer what is going on to anybody who follow the codepath. We have one path obtained from getcwd() and a set of paths all of which are real paths without symbolic aliasing, and we check if one among the latter cover an earlier part of the former. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths
On 09/30/2012 10:00 AM, Junio C Hamano wrote: > Michael Haggerty writes: > >> longest_ancestor_length() relies on a textual comparison of directory >> parts to find the part of path that overlaps with one of the paths in >> prefix_list. But this doesn't work if any of the prefixes involves a >> symbolic link, because the directories will look different even though >> they might logically refer to the same directory. So canonicalize the >> paths listed in prefix_list using real_path_if_valid() before trying >> to find matches. > > I somehow feel that this is making the logic unnecessarily > convoluted by solving the problem at a wrong level. > > If longest_ancestor_length() takes a single string and a list of > candidate string prefixes, conceptually it should be usable for any > hierarchy-looking string that uses slashes as hierarchy separator > (e.g. refs that may be stored in packed-refs that you cannot expect > lstat(2) or readlink(2) to give you any sensible results). > > The real problem is that the list given from the environment may > contain a path that violates that "it suffices to take the longest > string-prefix taking slashes into account" assumption in such a > generic l-a-l implementation, no? And this patch solves it by > making l-a-l _less_ generic and forcing it to be aware of the glitch > of its caller (you can either blame clueless user who lies when > setting the GIT_CEILING_DIRECTORIES by including paths contaminated > with symlinks, or blame the calling setup_git_directory_gently_1() > function that does not resolve the symbolic links before calling > this function). > > As l-a-l is only used by the "stop at the ceiling" logic, isn't it a > far simpler solution to keep the function work at the string level, > and make the caller fix its input, i.e. the value taken from the > environment variable, before calling it? That is, grab the value of > GIT_CEILING_DIRECTORIES, split it into a list at PATH_SEP (while > rejecting any non-absolute paths), normalize the elements of that > list by resolving symbolic links, and then pass the cwd[] and the > normalized string list to l-a-l? > > The resulting callsite in setup_git_directory_gently_1() would pass > cwd[] and the ceiling_dirs (which now is a string list), all of > whose elements would happen to begin with "/" (or dos drive prefix > followed by the "root" symbol), but l-a-l can be written in such a > way that it does not even require that all the input has to begin at > root, which would later make it usable with things that are not > paths (references, for example, all of which begin with "refs/" and > not "/refs/"). I agree that longest_ancestor_length() is not very generic or interesting anymore. Nearly all of its "added value" comes from the normalize_path_callback() helper function. One possibility would be to inline it at the one place it is called. The function string_list_longest_prefix() was my attempt to isolate the kernel of generic functionality from longest_ancestor_path(). It is *almost* the function that you are proposing, with the exception that it does not ensure that the prefix match ends at a "/" boundary. So another alternative could be to change this function to respect "/" boundaries. (After the change, the function might not belong in the string-list API anymore.) However, the semantics of a function that matches prefixes at "/" boundaries is not entirely obvious. Suppose we would implement the test via a function like path_prefixcmp(path, prefix). I can think of a few policy questions that would have to be answered: * Is a trailing slash on the prefix argument required, optional, or prohibited? What if the prefix is "/" or "//" or "c:/"? * Is a trailing slash on the path argument optional/prohibited? Are "/" or "//" allowed as path arguments? * Is a path its own prefix? path_prefixcmp("a/b", "a/b") -> true or false? (For the implementation of longest_ancestor_path(), we would prefer this to return "false".) Or does the answer depend on whether the prefix has a trailing slash? path_prefixcmp("a/b", "a/b") -> true? path_prefixcmp("a/b", "a/b/") -> false? Part of the reason that I implemented string_list_longest_prefix() instead of the function that you suggest is that the behavior of the former is far more obvious. I think I would advocate that the prefix has to match the front of the path exactly (including any trailing slashes) and either strlen(prefix) == 0 or the prefix ended with a '/' or the prefix and path are identical or the character in path following the matching part is a '/' This would allow the "is path its own prefix" policy to be decided by the caller by either including or omitting a trailing slash on the prefix argument. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordom
Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths
Michael Haggerty writes: > longest_ancestor_length() relies on a textual comparison of directory > parts to find the part of path that overlaps with one of the paths in > prefix_list. But this doesn't work if any of the prefixes involves a > symbolic link, because the directories will look different even though > they might logically refer to the same directory. So canonicalize the > paths listed in prefix_list using real_path_if_valid() before trying > to find matches. I somehow feel that this is making the logic unnecessarily convoluted by solving the problem at a wrong level. If longest_ancestor_length() takes a single string and a list of candidate string prefixes, conceptually it should be usable for any hierarchy-looking string that uses slashes as hierarchy separator (e.g. refs that may be stored in packed-refs that you cannot expect lstat(2) or readlink(2) to give you any sensible results). The real problem is that the list given from the environment may contain a path that violates that "it suffices to take the longest string-prefix taking slashes into account" assumption in such a generic l-a-l implementation, no? And this patch solves it by making l-a-l _less_ generic and forcing it to be aware of the glitch of its caller (you can either blame clueless user who lies when setting the GIT_CEILING_DIRECTORIES by including paths contaminated with symlinks, or blame the calling setup_git_directory_gently_1() function that does not resolve the symbolic links before calling this function). As l-a-l is only used by the "stop at the ceiling" logic, isn't it a far simpler solution to keep the function work at the string level, and make the caller fix its input, i.e. the value taken from the environment variable, before calling it? That is, grab the value of GIT_CEILING_DIRECTORIES, split it into a list at PATH_SEP (while rejecting any non-absolute paths), normalize the elements of that list by resolving symbolic links, and then pass the cwd[] and the normalized string list to l-a-l? The resulting callsite in setup_git_directory_gently_1() would pass cwd[] and the ceiling_dirs (which now is a string list), all of whose elements would happen to begin with "/" (or dos drive prefix followed by the "root" symbol), but l-a-l can be written in such a way that it does not even require that all the input has to begin at root, which would later make it usable with things that are not paths (references, for example, all of which begin with "refs/" and not "/refs/"). Hrm? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths
longest_ancestor_length() relies on a textual comparison of directory parts to find the part of path that overlaps with one of the paths in prefix_list. But this doesn't work if any of the prefixes involves a symbolic link, because the directories will look different even though they might logically refer to the same directory. So canonicalize the paths listed in prefix_list using real_path_if_valid() before trying to find matches. path is already in canonical form, so doesn't need to be canonicalized again. This fixes some problems with using GIT_CEILING_DIRECTORIES that contains paths involving symlinks, including t4035 if run with --root set to a path involving symlinks. Remove a number of tests of longest_ancestor_length(). It is awkward to test longest_ancestor_length() now, because its new path normalization behavior depends on the contents of the whole filesystem. On the other hand: * longest_ancestor_length() is now built of reusable components that are themselves tested separately (string_list_split(), string_list_longest_prefix(), and real_path_if_valid()), so it contains less code that can go wrong. * longest_ancestor_length() gets some testing (albeit not systematic) via the GIT_CEILING_DIRECTORIES tests. Therefore the work of updating these tests exceeds any expected benefits. Signed-off-by: Michael Haggerty --- path.c| 18 +-- t/t0060-path-utils.sh | 64 --- 2 files changed, 11 insertions(+), 71 deletions(-) diff --git a/path.c b/path.c index b20f2fb..40d7360 100644 --- a/path.c +++ b/path.c @@ -570,21 +570,25 @@ int normalize_path_copy(char *dst, const char *src) static int normalize_path_callback(struct string_list_item *item, void *cb_data) { - char buf[PATH_MAX+2]; + char *buf; const char *ceil = item->string; - int len = strlen(ceil); + const char *realpath; + int len; - if (len == 0 || len > PATH_MAX || !is_absolute_path(ceil)) + if (!*ceil || !is_absolute_path(ceil)) return 0; - if (normalize_path_copy(buf, ceil) < 0) + realpath = real_path_if_valid(ceil); + if (!realpath) return 0; - len = strlen(buf); + len = strlen(realpath); + buf = xmalloc(len + 2); /* Leave space for possible trailing slash */ + strcpy(buf, realpath); if (len == 0 || buf[len-1] != '/') { buf[len++] = '/'; - buf[len++] = '\0'; + buf[len] = '\0'; } free(item->string); - item->string = xstrdup(buf); + item->string = buf; return 1; } diff --git a/t/t0060-path-utils.sh b/t/t0060-path-utils.sh index 4ef2345..c97bbf2 100755 --- a/t/t0060-path-utils.sh +++ b/t/t0060-path-utils.sh @@ -12,28 +12,6 @@ norm_path() { "test \"\$(test-path-utils normalize_path_copy '$1')\" = '$2'" } -# On Windows, we are using MSYS's bash, which mangles the paths. -# Absolute paths are anchored at the MSYS installation directory, -# which means that the path / accounts for this many characters: -rootoff=$(test-path-utils normalize_path_copy / | wc -c) -# Account for the trailing LF: -if test $rootoff = 2; then - rootoff=# we are on Unix -else - rootoff=$(($rootoff-1)) -fi - -ancestor() { - # We do some math with the expected ancestor length. - expected=$3 - if test -n "$rootoff" && test "x$expected" != x-1; then - expected=$(($expected+$rootoff)) - fi - test_expect_success "longest ancestor: $1 $2 => $expected" \ - "actual=\$(test-path-utils longest_ancestor_length '$1' '$2') && -test \"\$actual\" = '$expected'" -} - # Absolute path tests must be skipped on Windows because due to path mangling # the test program never sees a POSIX-style absolute path case $(uname -s) in @@ -93,48 +71,6 @@ norm_path /d1/s1//../s2/../../d2 /d2 POSIX norm_path /d1/.../d2 /d1/.../d2 POSIX norm_path /d1/..././../d2 /d1/d2 POSIX -ancestor / "" -1 -ancestor / / -1 -ancestor /foo "" -1 -ancestor /foo : -1 -ancestor /foo ::. -1 -ancestor /foo ::..:: -1 -ancestor /foo / 0 -ancestor /foo /fo -1 -ancestor /foo /foo -1 -ancestor /foo /foo/ -1 -ancestor /foo /bar -1 -ancestor /foo /bar/ -1 -ancestor /foo /foo/bar -1 -ancestor /foo /foo:/bar/ -1 -ancestor /foo /foo/:/bar/ -1 -ancestor /foo /foo::/bar/ -1 -ancestor /foo /:/foo:/bar/ 0 -ancestor /foo /foo:/:/bar/ 0 -ancestor /foo /:/bar/:/foo 0 -ancestor /foo/bar "" -1 -ancestor /foo/bar / 0 -ancestor /foo/bar /fo -1 -ancestor /foo/bar foo -1 -ancestor /foo/bar /foo 4 -ancestor /foo/bar /foo/ 4 -ancestor /foo/bar /foo/ba -1 -ancestor /foo/bar /:/fo 0 -ancestor /foo/bar /foo:/foo/ba 4 -ancestor /foo/bar /bar -1 -ancestor /foo/bar /bar/ -1 -ancestor /foo/bar /fo: -1 -ancestor /foo/bar :/fo -1 -ancestor /foo/bar /foo:/bar/ 4 -ancestor /foo/bar /:/foo:/bar/ 4 -ancestor /foo/bar /foo:/:/bar/ 4 -ancestor /foo/bar /:/bar/:/fo