Re: [BUG] gitignore documentation inconsistent with actual behaviour
On 20 Oct 2018, at 01:03, Duy Nguyen wrote: >foo**bar would match foobar as well as foo/bar, foo/x/bar and >foo/x/y/bar... Its behavior is error prone in my opinion. There's also >some concerns in early iterations of this "**" support that we would >need to revisit if we want 'rsync' behavior. I'm not very excited >about doing that. That's fair. I guess another point in favour of your second option is that it's essentially the same behaviour used by bash (with the `globstar` option) and zsh (with the default options); they also give `**` special recursion powers only when used in a path component by itself, otherwise it acts like `*`. So there's precedent there. dana
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On Sat, Oct 20, 2018 at 7:53 AM dana wrote: > > On 20 Oct 2018, at 00:26, Duy Nguyen wrote: > >Which way should we go? I'm leaning towards the second one... > > Not sure how much my opinion is worth, but the second option does feel more > friendly (from a usage perspective) as well as more straight-forward (from a > re-implementation perspective). Yeah. And not having to describe all the corner cases is a plus. Too many corner cases are a sign of bad implementation anyway. I'll wait some more time for the others to speak up before I cook a proper patch. > There's a third option too, though, right? The 'rsync' behaviour mentioned > earlier? It wouldn't matter either way in any of the examples i listed, but is > there ever a conceivable use case for something like `foo**bar`, where the > `**` > matches across slashes? (I can't think of any reason i'd need that personally, > but then again i don't understand why these people are using `**` the way they > are in the first place.) foo**bar would match foobar as well as foo/bar, foo/x/bar and foo/x/y/bar... Its behavior is error prone in my opinion. There's also some concerns in early iterations of this "**" support that we would need to revisit if we want 'rsync' behavior. I'm not very excited about doing that. -- Duy
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On 20 Oct 2018, at 00:26, Duy Nguyen wrote: >Which way should we go? I'm leaning towards the second one... Not sure how much my opinion is worth, but the second option does feel more friendly (from a usage perspective) as well as more straight-forward (from a re-implementation perspective). There's a third option too, though, right? The 'rsync' behaviour mentioned earlier? It wouldn't matter either way in any of the examples i listed, but is there ever a conceivable use case for something like `foo**bar`, where the `**` matches across slashes? (I can't think of any reason i'd need that personally, but then again i don't understand why these people are using `**` the way they are in the first place.) dana
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On Thu, Oct 11, 2018 at 05:19:06AM -0500, dana wrote: > Hello, > > I'm a contributor to ripgrep, which is a grep-like tool that supports using > gitignore files to control which files are searched in a repo (or any other > directory tree). ripgrep's support for the patterns in these files is based on > git's official documentation, as seen here: > > https://git-scm.com/docs/gitignore > > One of the most common reports on the ripgrep bug tracker is that it does not > allow patterns like the following real-world examples, where a ** is used > along > with other text within the same path component: > > **/**$$*.java > **.orig > **local.properties > !**.sha1 > > The reason it doesn't allow them is that the gitignore documentation > explicitly > states that they're invalid: > > ... I've checked the code and run some tests. There is a twist here. "**" is only special when matched in "pathname" mode. That is when the pattern contains at least one slash. In your patterns above, that only applies to the first pattern. When '**' is special, if it's neither '**/', '/**/' or '/**', it _is_ considered invalid (i.e. bad pattern) and the pattern will not match anything. The confusion comes from when '**' is not special for the remaining three patterns, it's considered as regular '*' and still matches stuff. So, I think we have two options. The document could be clarified with something like this -- 8< -- diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt index d107daaffd..500cd43939 100644 --- a/Documentation/gitignore.txt +++ b/Documentation/gitignore.txt @@ -100,7 +100,8 @@ PATTERN FORMAT a shell glob pattern and checks for a match against the pathname relative to the location of the `.gitignore` file (relative to the toplevel of the work tree if not from a - `.gitignore` file). + `.gitignore` file). Note that the "two consecutive asterisks" rule + below does not apply. - Otherwise, Git treats the pattern as a shell glob: "`*`" matches anything except "`/`", "`?`" matches any one character except "`/`" @@ -129,7 +130,8 @@ full pathname may have special meaning: matches zero or more directories. For example, "`a/**/b`" matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on. - - Other consecutive asterisks are considered invalid. + - Other consecutive asterisks are considered invalid and the pattern + is ignored. NOTES - -- 8< -- Or we could make the behavior consistent. If '**' is invalid, just consider it two separate regular '*'. Then all four of your patterns will behave the same way. The change for that is quite simple -- 8< -- diff --git a/wildmatch.c b/wildmatch.c index d074c1be10..64087bf02c 100644 --- a/wildmatch.c +++ b/wildmatch.c @@ -104,8 +104,10 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags) dowild(p + 1, text, flags) == WM_MATCH) return WM_MATCH; match_slash = 1; - } else - return WM_ABORT_MALFORMED; + } else { + /* without WM_PATHNAME, '*' == '**' */ + match_slash = flags & WM_PATHNAME ? 0 : 1; + } } else /* without WM_PATHNAME, '*' == '**' */ match_slash = flags & WM_PATHNAME ? 0 : 1; -- 8< -- Which way should we go? I'm leaning towards the second one... -- Duy
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On Mon, Oct 15, 2018 at 12:57 AM Junio C Hamano wrote: > > Duy Nguyen writes: > > >> Our matching function comes from rsync originally, whose manpage says: > >> > >> use ’**’ to match anything, including slashes. > >> > >> I believe this is accurate as far as the implementation goes. > > > > No. "**" semantics is not the same as from rsync. The three cases > > "**/", "/**/" and "/**" were requested by Junio if I remember > > correctly. You can search the mail archive for more information. > > Perhaps spelling the rules out would be more benefitial for the > purpose of this thread? I do not recall what I requested, but let OK I gave you too much credit. You pointed out semantics problem with rsync "**" [1] and gently pushed me to implement a safer subset ;-) [1] https://public-inbox.org/git/7vbogj5sji@alter.siamese.dyndns.org/ > me throw out my guesses (i.e. what I would have wished if I were > making a request to implement something) to keep the thread alive, > you can correct me, and people can take it from there to update the > docs ;-) > > A double-asterisk, both of whose ends are adjacent to a > directory boundary (i.e. the beginning of the pattern, the end > of the pattern or a slash) macthes 0 or more levels of > directories. e.g. **/a/b would match a/b, x/a/b, x/y/a/b, but > not z-a/b. a/**/b would match a/b, a/x/b, but not a/z-b or > a-z-b. > > What a double-asterisk that does not sit on a directory boundary, > e.g. "a**b", "a**/b", "a/**b", or "**a/b", matches, as far as I am > concerned, is undefined, meaning that (1) I do not care all that > much what the code actually do when a pattern like that is given as > long as it does not segfault, and (2) I do not think I would mind > changing the behaviour as a "bugfix", if their current behaviour > does not make sense and we can come up with a saner alternative. I think the document describes more or less what you wrote above in the indented paragraph (but maybe less clear, patches are of course welcome). It's the last paragraph that is problematic. It right now says "invalid" which could be interpreted as "bad pattern, rejected" but we in fact accept these "*" are regular ones. There are not many alternatives we could do though. Erroring out could really flood the stderr because we match these patterns zillions of time and traditionally fnmatch gracefully accepts bad patterns, trying to make the best of of them. So keeping undefined "**" as two "*" sounds good enough. But of course I'm open for saner alternatives if people find any. -- Duy
Re: [BUG] gitignore documentation inconsistent with actual behaviour
Duy Nguyen writes: >> Our matching function comes from rsync originally, whose manpage says: >> >> use ’**’ to match anything, including slashes. >> >> I believe this is accurate as far as the implementation goes. > > No. "**" semantics is not the same as from rsync. The three cases > "**/", "/**/" and "/**" were requested by Junio if I remember > correctly. You can search the mail archive for more information. Perhaps spelling the rules out would be more benefitial for the purpose of this thread? I do not recall what I requested, but let me throw out my guesses (i.e. what I would have wished if I were making a request to implement something) to keep the thread alive, you can correct me, and people can take it from there to update the docs ;-) A double-asterisk, both of whose ends are adjacent to a directory boundary (i.e. the beginning of the pattern, the end of the pattern or a slash) macthes 0 or more levels of directories. e.g. **/a/b would match a/b, x/a/b, x/y/a/b, but not z-a/b. a/**/b would match a/b, a/x/b, but not a/z-b or a-z-b. What a double-asterisk that does not sit on a directory boundary, e.g. "a**b", "a**/b", "a/**b", or "**a/b", matches, as far as I am concerned, is undefined, meaning that (1) I do not care all that much what the code actually do when a pattern like that is given as long as it does not segfault, and (2) I do not think I would mind changing the behaviour as a "bugfix", if their current behaviour does not make sense and we can come up with a saner alternative. But the documentation probably should describe what these currently match as the starting point. Thanks.
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On Thu, Oct 11, 2018 at 2:41 PM Ævar Arnfjörð Bjarmason wrote: > > > On Thu, Oct 11 2018, dana wrote: > > > Hello, > > > > I'm a contributor to ripgrep, which is a grep-like tool that supports using > > gitignore files to control which files are searched in a repo (or any other > > directory tree). ripgrep's support for the patterns in these files is based > > on > > git's official documentation, as seen here: > > > > https://git-scm.com/docs/gitignore > > > > One of the most common reports on the ripgrep bug tracker is that it does > > not > > allow patterns like the following real-world examples, where a ** is used > > along > > with other text within the same path component: > > > > **/**$$*.java > > **.orig > > **local.properties > > !**.sha1 > > > > The reason it doesn't allow them is that the gitignore documentation > > explicitly > > states that they're invalid: > > > > A leading "**" followed by a slash means match in all directories... > > > > A trailing "/**" matches everything inside... > > > > A slash followed by two consecutive asterisks then a slash matches zero or > > more directories... > > > > Other consecutive asterisks are considered invalid. Perhaps "undefined" is a better word than "invalid". > > git itself happily accepts these patterns, however, apparently treating the > > ** > > like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, > > it > > matches / as a regular character, thus crossing directory boundaries). > > > > ripgrep's developer is loathe to reverse-engineer this undocumented > > behaviour, > > and so the reports keep coming, both for ripgrep itself and for down-stream > > consumers of it and its ignore crate (including most notably Microsoft's VS > > Code > > editor). > > > > My request: Assuming that git's actual handling of these patterns is > > intended, > > would it be possible to make it 'official' and explicitly add it to the > > documentation? You mean "**" in the fourth case is interpreted as "*"? Yes I guess we could rephrase it as "Other consecutive asterisks are considered normal wildcard asterisks" > Yeah those docs seem wrong. In general the docs for the matching > function are quite bad. I have on my TODO list to factor this out into > some gitwildmatch manpage, but right now the bit in gitignore is all we > have. > > Our matching function comes from rsync originally, whose manpage says: > > use ’**’ to match anything, including slashes. > > I believe this is accurate as far as the implementation goes. No. "**" semantics is not the same as from rsync. The three cases "**/", "/**/" and "/**" were requested by Junio if I remember correctly. You can search the mail archive for more information. -- Duy
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On 11 Oct 2018, at 06:08, Ævar Arnfjörð Bjarmason wrote: >Our matching function comes from rsync originally, whose manpage says: > > use ’**’ to match anything, including slashes. > >I believe this is accurate as far as the implementation goes. You can >also see the rather exhaustive tests here: >https://github.com/git/git/blob/master/t/t3070-wildmatch.sh Thanks. The tests are a little hard to follow at first glance, so i guess i'd have to study them more to see what the semantics actually are. When i tested it briefly it didn't seem like it was behaving as you described, but maybe i wasn't paying close enough attention. In any case, i'm sure these will be useful. On 11 Oct 2018, at 06:08, Ævar Arnfjörð Bjarmason wrote: >The reason I'm on this tangent is to ask whether if such a thing >existed, if it's something you can see something like ripgrep >using. I.e. ask git given its .gitignore and .gitattributes what thing >matches one of its pathspecs instead of carrying your own bug-compatible >implementation. My impression is that it probably isn't practical to use something like that as ripgrep's main pattern-matching facility — for one thing there are performance concerns, and for another it makes it dependent on git for some of its core functionality (and it's not a git-specific tool). But i was going to say that it'd be useful for testing, and when i e-mailed Andrew (the main ripgrep dev) about this he said the same. I've CCed him on this message in case he has anything to add. dana
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On Thu, Oct 11 2018, dana wrote: > Hello, > > I'm a contributor to ripgrep, which is a grep-like tool that supports using > gitignore files to control which files are searched in a repo (or any other > directory tree). ripgrep's support for the patterns in these files is based on > git's official documentation, as seen here: > > https://git-scm.com/docs/gitignore > > One of the most common reports on the ripgrep bug tracker is that it does not > allow patterns like the following real-world examples, where a ** is used > along > with other text within the same path component: > > **/**$$*.java > **.orig > **local.properties > !**.sha1 > > The reason it doesn't allow them is that the gitignore documentation > explicitly > states that they're invalid: > > A leading "**" followed by a slash means match in all directories... > > A trailing "/**" matches everything inside... > > A slash followed by two consecutive asterisks then a slash matches zero or > more directories... > > Other consecutive asterisks are considered invalid. > > git itself happily accepts these patterns, however, apparently treating the ** > like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it > matches / as a regular character, thus crossing directory boundaries). > > ripgrep's developer is loathe to reverse-engineer this undocumented behaviour, > and so the reports keep coming, both for ripgrep itself and for down-stream > consumers of it and its ignore crate (including most notably Microsoft's VS > Code > editor). > > My request: Assuming that git's actual handling of these patterns is intended, > would it be possible to make it 'official' and explicitly add it to the > documentation? > > References (the first one is the main bug): > > https://github.com/BurntSushi/ripgrep/issues/373 > https://github.com/BurntSushi/ripgrep/issues/507 > https://github.com/BurntSushi/ripgrep/issues/859 > https://github.com/BurntSushi/ripgrep/issues/945 > https://github.com/BurntSushi/ripgrep/issues/1080 > https://github.com/BurntSushi/ripgrep/issues/1082 > https://github.com/Microsoft/vscode/issues/24050 Yeah those docs seem wrong. In general the docs for the matching function are quite bad. I have on my TODO list to factor this out into some gitwildmatch manpage, but right now the bit in gitignore is all we have. Our matching function comes from rsync originally, whose manpage says: use ’**’ to match anything, including slashes. I believe this is accurate as far as the implementation goes. You can also see the rather exhaustive tests here: https://github.com/git/git/blob/master/t/t3070-wildmatch.sh Note the different behavior with e.g. --glob-pathspecs v.s. the default. There's also stuff like: $ grep diff=perl .gitattributes *.perl eol=lf diff=perl *.pl eof=lf diff=perl *.pm eol=lf diff=perl $ git ls-files ":(attr:diff=perl)" | wc -l 65 And then the exclude syntax. This is not in .gitignore: $ git ls-files ":(exclude)*.pm" ":(attr:diff=perl)" | wc -l 41 $ git ls-files ":^*.pm" ":(attr:diff=perl)" | wc -l 41 I.e. we have wildmatch() on one hand and then the pathspec matching. For an unrelated thing I have been thinking of adding a new plumbing command who'd get input like this on stdin: 1 text t/t0202/test.pl\0\0 2 text perl/Git.pm\0\0 3 text *.pm\0\0 4 text :^*.pm"\0:(attr:diff=perl)\0\0 5 match glob,icase\04\03\0\0 6 match icase\04\02\0\0 7 match \04\01\0\0 Which would return (in any order): 1 OK 2 OK 3 OK 4 OK 5 NO 6 NO 7 YES Or whatever. I.e. something where you can as a batch feed various strings into a program in a batch, and then ask how some of them would match against each other. The reason for this is to extend something like git-multimail[1] with a config where users can subscribe to changes to paths as declared by git pathspecs, and be informed about which of the glob rules they defined matched their config. Right now you'd need to e.g. for a "git push" run each match rule for each user with such config against each changed path in each commit that's pushed, but plumbing like this would allow for feeding arbitrary combinations of those in and ask what matches against what. The reason I'm on this tangent is to ask whether if such a thing existed, if it's something you can see something like ripgrep using. I.e. ask git given its .gitignore and .gitattributes what thing matches one of its pathspecs instead of carrying your own bug-compatible implementation. 1. https://github.com/git-multimail/git-multimail/
Re: [BUG] gitignore documentation inconsistent with actual behaviour
On 11 Oct 2018, at 05:19, dana wrote: >git itself happily accepts these patterns, however, apparently treating the ** >like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it >matches / as a regular character, thus crossing directory boundaries). Sorry for replying to myself so quickly, but i think this bit is wrong. It seems like it actually just treats ** in this sort of pattern like a regular * — but i haven't studied the source very closely, so maybe that's not the full story either, i'm not sure. dana
[BUG] gitignore documentation inconsistent with actual behaviour
Hello, I'm a contributor to ripgrep, which is a grep-like tool that supports using gitignore files to control which files are searched in a repo (or any other directory tree). ripgrep's support for the patterns in these files is based on git's official documentation, as seen here: https://git-scm.com/docs/gitignore One of the most common reports on the ripgrep bug tracker is that it does not allow patterns like the following real-world examples, where a ** is used along with other text within the same path component: **/**$$*.java **.orig **local.properties !**.sha1 The reason it doesn't allow them is that the gitignore documentation explicitly states that they're invalid: A leading "**" followed by a slash means match in all directories... A trailing "/**" matches everything inside... A slash followed by two consecutive asterisks then a slash matches zero or more directories... Other consecutive asterisks are considered invalid. git itself happily accepts these patterns, however, apparently treating the ** like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it matches / as a regular character, thus crossing directory boundaries). ripgrep's developer is loathe to reverse-engineer this undocumented behaviour, and so the reports keep coming, both for ripgrep itself and for down-stream consumers of it and its ignore crate (including most notably Microsoft's VS Code editor). My request: Assuming that git's actual handling of these patterns is intended, would it be possible to make it 'official' and explicitly add it to the documentation? References (the first one is the main bug): https://github.com/BurntSushi/ripgrep/issues/373 https://github.com/BurntSushi/ripgrep/issues/507 https://github.com/BurntSushi/ripgrep/issues/859 https://github.com/BurntSushi/ripgrep/issues/945 https://github.com/BurntSushi/ripgrep/issues/1080 https://github.com/BurntSushi/ripgrep/issues/1082 https://github.com/Microsoft/vscode/issues/24050 dana