Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-20 Thread dana
On 20 Oct 2018, at 01:03, Duy Nguyen  wrote:
>foo**bar would match foobar as well as foo/bar, foo/x/bar and
>foo/x/y/bar... Its behavior is error prone in my opinion. There's also
>some concerns in early iterations of this "**" support that we would
>need to revisit if we want 'rsync' behavior. I'm not very excited
>about doing that.

That's fair.

I guess another point in favour of your second option is that it's essentially
the same behaviour used by bash (with the `globstar` option) and zsh (with the
default options); they also give `**` special recursion powers only when used in
a path component by itself, otherwise it acts like `*`. So there's precedent
there.

dana



Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-20 Thread Duy Nguyen
On Sat, Oct 20, 2018 at 7:53 AM dana  wrote:
>
> On 20 Oct 2018, at 00:26, Duy Nguyen  wrote:
> >Which way should we go? I'm leaning towards the second one...
>
> Not sure how much my opinion is worth, but the second option does feel more
> friendly (from a usage perspective) as well as more straight-forward (from a
> re-implementation perspective).

Yeah. And not having to describe all the corner cases is a plus. Too
many corner cases are a sign of bad implementation anyway. I'll wait
some more time for the others to speak up before I cook a proper
patch.

> There's a third option too, though, right? The 'rsync' behaviour mentioned
> earlier? It wouldn't matter either way in any of the examples i listed, but is
> there ever a conceivable use case for something like `foo**bar`, where the 
> `**`
> matches across slashes? (I can't think of any reason i'd need that personally,
> but then again i don't understand why these people are using `**` the way they
> are in the first place.)

foo**bar would match foobar as well as foo/bar, foo/x/bar and
foo/x/y/bar... Its behavior is error prone in my opinion. There's also
some concerns in early iterations of this "**" support that we would
need to revisit if we want 'rsync' behavior. I'm not very excited
about doing that.
-- 
Duy


Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-19 Thread dana
On 20 Oct 2018, at 00:26, Duy Nguyen  wrote:
>Which way should we go? I'm leaning towards the second one...

Not sure how much my opinion is worth, but the second option does feel more
friendly (from a usage perspective) as well as more straight-forward (from a
re-implementation perspective).

There's a third option too, though, right? The 'rsync' behaviour mentioned
earlier? It wouldn't matter either way in any of the examples i listed, but is
there ever a conceivable use case for something like `foo**bar`, where the `**`
matches across slashes? (I can't think of any reason i'd need that personally,
but then again i don't understand why these people are using `**` the way they
are in the first place.)

dana



Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-19 Thread Duy Nguyen
On Thu, Oct 11, 2018 at 05:19:06AM -0500, dana wrote:
> Hello,
> 
> I'm a contributor to ripgrep, which is a grep-like tool that supports using
> gitignore files to control which files are searched in a repo (or any other
> directory tree). ripgrep's support for the patterns in these files is based on
> git's official documentation, as seen here:
> 
>   https://git-scm.com/docs/gitignore
> 
> One of the most common reports on the ripgrep bug tracker is that it does not
> allow patterns like the following real-world examples, where a ** is used 
> along
> with other text within the same path component:
> 
>   **/**$$*.java
>   **.orig
>   **local.properties
>   !**.sha1
> 
> The reason it doesn't allow them is that the gitignore documentation 
> explicitly
> states that they're invalid:
>
> ...

I've checked the code and run some tests. There is a twist here. "**"
is only special when matched in "pathname" mode. That is when the
pattern contains at least one slash. In your patterns above, that only
applies to the first pattern.

When '**' is special, if it's neither '**/', '/**/' or '/**', it _is_
considered invalid (i.e. bad pattern) and the pattern will not match
anything.

The confusion comes from when '**' is not special for the remaining
three patterns, it's considered as regular '*' and still matches
stuff.

So, I think we have two options. The document could be clarified with
something like this

-- 8< --
diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt
index d107daaffd..500cd43939 100644
--- a/Documentation/gitignore.txt
+++ b/Documentation/gitignore.txt
@@ -100,7 +100,8 @@ PATTERN FORMAT
a shell glob pattern and checks for a match against the
pathname relative to the location of the `.gitignore` file
(relative to the toplevel of the work tree if not from a
-   `.gitignore` file).
+   `.gitignore` file). Note that the "two consecutive asterisks" rule
+   below does not apply.
 
  - Otherwise, Git treats the pattern as a shell glob: "`*`" matches
anything except "`/`", "`?`" matches any one character except "`/`"
@@ -129,7 +130,8 @@ full pathname may have special meaning:
matches zero or more directories. For example, "`a/**/b`"
matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on.
 
- - Other consecutive asterisks are considered invalid.
+ - Other consecutive asterisks are considered invalid and the pattern
+   is ignored.
 
 NOTES
 -
-- 8< --

Or we could make the behavior consistent. If '**' is invalid, just
consider it two separate regular '*'. Then all four of your patterns
will behave the same way. The change for that is quite simple

-- 8< --
diff --git a/wildmatch.c b/wildmatch.c
index d074c1be10..64087bf02c 100644
--- a/wildmatch.c
+++ b/wildmatch.c
@@ -104,8 +104,10 @@ static int dowild(const uchar *p, const uchar *text, 
unsigned int flags)
dowild(p + 1, text, flags) == 
WM_MATCH)
return WM_MATCH;
match_slash = 1;
-   } else
-   return WM_ABORT_MALFORMED;
+   } else {
+   /* without WM_PATHNAME, '*' == '**' */
+   match_slash = flags & WM_PATHNAME ? 0 : 
1;
+   }
} else
/* without WM_PATHNAME, '*' == '**' */
match_slash = flags & WM_PATHNAME ? 0 : 1;
-- 8< --

Which way should we go? I'm leaning towards the second one...
--
Duy


Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-15 Thread Duy Nguyen
On Mon, Oct 15, 2018 at 12:57 AM Junio C Hamano  wrote:
>
> Duy Nguyen  writes:
>
> >> Our matching function comes from rsync originally, whose manpage says:
> >>
> >> use ’**’ to match anything, including slashes.
> >>
> >> I believe this is accurate as far as the implementation goes.
> >
> > No. "**" semantics is not the same as from rsync. The three cases
> > "**/", "/**/" and "/**" were requested by Junio if I remember
> > correctly. You can search the mail archive for more information.
>
> Perhaps spelling the rules out would be more benefitial for the
> purpose of this thread?  I do not recall what I requested, but let

OK I gave you too much credit. You pointed out semantics problem with
rsync "**" [1] and gently pushed me to implement a safer subset ;-)

[1] https://public-inbox.org/git/7vbogj5sji@alter.siamese.dyndns.org/

> me throw out my guesses (i.e. what I would have wished if I were
> making a request to implement something) to keep the thread alive,
> you can correct me, and people can take it from there to update the
> docs ;-)
>
> A double-asterisk, both of whose ends are adjacent to a
> directory boundary (i.e. the beginning of the pattern, the end
> of the pattern or a slash) macthes 0 or more levels of
> directories.  e.g. **/a/b would match a/b, x/a/b, x/y/a/b, but
> not z-a/b.  a/**/b would match a/b, a/x/b, but not a/z-b or
> a-z-b.
>
> What a double-asterisk that does not sit on a directory boundary,
> e.g. "a**b", "a**/b", "a/**b", or "**a/b", matches, as far as I am
> concerned, is undefined, meaning that (1) I do not care all that
> much what the code actually do when a pattern like that is given as
> long as it does not segfault, and (2) I do not think I would mind
> changing the behaviour as a "bugfix", if their current behaviour
> does not make sense and we can come up with a saner alternative.

I think the document describes more or less what you wrote above in
the indented paragraph (but maybe less clear, patches are of course
welcome). It's the last paragraph that is problematic. It right now
says "invalid" which could be interpreted as "bad pattern, rejected"
but we in fact accept these "*" are regular ones.

There are not many alternatives we could do though. Erroring out could
really flood the stderr because we match these patterns zillions of
time and traditionally fnmatch gracefully accepts bad patterns, trying
to make the best of of them. So keeping undefined "**" as two "*"
sounds good enough.

But of course I'm open for saner alternatives if people find any.
-- 
Duy


Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-14 Thread Junio C Hamano
Duy Nguyen  writes:

>> Our matching function comes from rsync originally, whose manpage says:
>>
>> use ’**’ to match anything, including slashes.
>>
>> I believe this is accurate as far as the implementation goes.
>
> No. "**" semantics is not the same as from rsync. The three cases
> "**/", "/**/" and "/**" were requested by Junio if I remember
> correctly. You can search the mail archive for more information.

Perhaps spelling the rules out would be more benefitial for the
purpose of this thread?  I do not recall what I requested, but let
me throw out my guesses (i.e. what I would have wished if I were
making a request to implement something) to keep the thread alive,
you can correct me, and people can take it from there to update the
docs ;-)

A double-asterisk, both of whose ends are adjacent to a
directory boundary (i.e. the beginning of the pattern, the end
of the pattern or a slash) macthes 0 or more levels of
directories.  e.g. **/a/b would match a/b, x/a/b, x/y/a/b, but
not z-a/b.  a/**/b would match a/b, a/x/b, but not a/z-b or
a-z-b.

What a double-asterisk that does not sit on a directory boundary,
e.g. "a**b", "a**/b", "a/**b", or "**a/b", matches, as far as I am
concerned, is undefined, meaning that (1) I do not care all that
much what the code actually do when a pattern like that is given as
long as it does not segfault, and (2) I do not think I would mind
changing the behaviour as a "bugfix", if their current behaviour
does not make sense and we can come up with a saner alternative.

But the documentation probably should describe what these currently
match as the starting point.

Thanks.


Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-14 Thread Duy Nguyen
On Thu, Oct 11, 2018 at 2:41 PM Ævar Arnfjörð Bjarmason
 wrote:
>
>
> On Thu, Oct 11 2018, dana wrote:
>
> > Hello,
> >
> > I'm a contributor to ripgrep, which is a grep-like tool that supports using
> > gitignore files to control which files are searched in a repo (or any other
> > directory tree). ripgrep's support for the patterns in these files is based 
> > on
> > git's official documentation, as seen here:
> >
> >   https://git-scm.com/docs/gitignore
> >
> > One of the most common reports on the ripgrep bug tracker is that it does 
> > not
> > allow patterns like the following real-world examples, where a ** is used 
> > along
> > with other text within the same path component:
> >
> >   **/**$$*.java
> >   **.orig
> >   **local.properties
> >   !**.sha1
> >
> > The reason it doesn't allow them is that the gitignore documentation 
> > explicitly
> > states that they're invalid:
> >
> >   A leading "**" followed by a slash means match in all directories...
> >
> >   A trailing "/**" matches everything inside...
> >
> >   A slash followed by two consecutive asterisks then a slash matches zero or
> >   more directories...
> >
> >   Other consecutive asterisks are considered invalid.

Perhaps "undefined" is a better word than "invalid".

> > git itself happily accepts these patterns, however, apparently treating the 
> > **
> > like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, 
> > it
> > matches / as a regular character, thus crossing directory boundaries).
> >
> > ripgrep's developer is loathe to reverse-engineer this undocumented 
> > behaviour,
> > and so the reports keep coming, both for ripgrep itself and for down-stream
> > consumers of it and its ignore crate (including most notably Microsoft's VS 
> > Code
> > editor).
> >
> > My request: Assuming that git's actual handling of these patterns is 
> > intended,
> > would it be possible to make it 'official' and explicitly add it to the
> > documentation?

You mean "**" in the fourth case is interpreted as "*"? Yes I guess we
could rephrase it as "Other consecutive asterisks are considered
normal wildcard asterisks"

> Yeah those docs seem wrong. In general the docs for the matching
> function are quite bad. I have on my TODO list to factor this out into
> some gitwildmatch manpage, but right now the bit in gitignore is all we
> have.
>
> Our matching function comes from rsync originally, whose manpage says:
>
> use ’**’ to match anything, including slashes.
>
> I believe this is accurate as far as the implementation goes.

No. "**" semantics is not the same as from rsync. The three cases
"**/", "/**/" and "/**" were requested by Junio if I remember
correctly. You can search the mail archive for more information.
-- 
Duy


Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-13 Thread dana
On 11 Oct 2018, at 06:08, Ævar Arnfjörð Bjarmason  wrote:
>Our matching function comes from rsync originally, whose manpage says:
>
>   use ’**’ to match anything, including slashes.
>
>I believe this is accurate as far as the implementation goes. You can
>also see the rather exhaustive tests here:
>https://github.com/git/git/blob/master/t/t3070-wildmatch.sh

Thanks. The tests are a little hard to follow at first glance, so i guess i'd
have to study them more to see what the semantics actually are. When i tested it
briefly it didn't seem like it was behaving as you described, but maybe i wasn't
paying close enough attention. In any case, i'm sure these will be useful.

On 11 Oct 2018, at 06:08, Ævar Arnfjörð Bjarmason  wrote:
>The reason I'm on this tangent is to ask whether if such a thing
>existed, if it's something you can see something like ripgrep
>using. I.e. ask git given its .gitignore and .gitattributes what thing
>matches one of its pathspecs instead of carrying your own bug-compatible
>implementation.

My impression is that it probably isn't practical to use something like that as
ripgrep's main pattern-matching facility — for one thing there are performance
concerns, and for another it makes it dependent on git for some of its core
functionality (and it's not a git-specific tool). But i was going to say that
it'd be useful for testing, and when i e-mailed Andrew (the main ripgrep dev)
about this he said the same. I've CCed him on this message in case he has
anything to add.

dana



Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-11 Thread Ævar Arnfjörð Bjarmason


On Thu, Oct 11 2018, dana wrote:

> Hello,
>
> I'm a contributor to ripgrep, which is a grep-like tool that supports using
> gitignore files to control which files are searched in a repo (or any other
> directory tree). ripgrep's support for the patterns in these files is based on
> git's official documentation, as seen here:
>
>   https://git-scm.com/docs/gitignore
>
> One of the most common reports on the ripgrep bug tracker is that it does not
> allow patterns like the following real-world examples, where a ** is used 
> along
> with other text within the same path component:
>
>   **/**$$*.java
>   **.orig
>   **local.properties
>   !**.sha1
>
> The reason it doesn't allow them is that the gitignore documentation 
> explicitly
> states that they're invalid:
>
>   A leading "**" followed by a slash means match in all directories...
>
>   A trailing "/**" matches everything inside...
>
>   A slash followed by two consecutive asterisks then a slash matches zero or
>   more directories...
>
>   Other consecutive asterisks are considered invalid.
>
> git itself happily accepts these patterns, however, apparently treating the **
> like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it
> matches / as a regular character, thus crossing directory boundaries).
>
> ripgrep's developer is loathe to reverse-engineer this undocumented behaviour,
> and so the reports keep coming, both for ripgrep itself and for down-stream
> consumers of it and its ignore crate (including most notably Microsoft's VS 
> Code
> editor).
>
> My request: Assuming that git's actual handling of these patterns is intended,
> would it be possible to make it 'official' and explicitly add it to the
> documentation?
>
> References (the first one is the main bug):
>
> https://github.com/BurntSushi/ripgrep/issues/373
> https://github.com/BurntSushi/ripgrep/issues/507
> https://github.com/BurntSushi/ripgrep/issues/859
> https://github.com/BurntSushi/ripgrep/issues/945
> https://github.com/BurntSushi/ripgrep/issues/1080
> https://github.com/BurntSushi/ripgrep/issues/1082
> https://github.com/Microsoft/vscode/issues/24050

Yeah those docs seem wrong. In general the docs for the matching
function are quite bad. I have on my TODO list to factor this out into
some gitwildmatch manpage, but right now the bit in gitignore is all we
have.

Our matching function comes from rsync originally, whose manpage says:

use ’**’ to match anything, including slashes.

I believe this is accurate as far as the implementation goes. You can
also see the rather exhaustive tests here:
https://github.com/git/git/blob/master/t/t3070-wildmatch.sh

Note the different behavior with e.g. --glob-pathspecs v.s. the
default. There's also stuff like:

$ grep diff=perl .gitattributes
*.perl eol=lf diff=perl
*.pl eof=lf diff=perl
*.pm eol=lf diff=perl
$ git ls-files ":(attr:diff=perl)" | wc -l
65

And then the exclude syntax. This is not in .gitignore:

$ git ls-files ":(exclude)*.pm" ":(attr:diff=perl)" | wc -l
41
$ git ls-files ":^*.pm" ":(attr:diff=perl)" | wc -l
41

I.e. we have wildmatch() on one hand and then the pathspec matching.

For an unrelated thing I have been thinking of adding a new plumbing
command who'd get input like this on stdin:

1 text t/t0202/test.pl\0\0
2 text perl/Git.pm\0\0
3 text *.pm\0\0
4 text :^*.pm"\0:(attr:diff=perl)\0\0
5 match glob,icase\04\03\0\0
6 match icase\04\02\0\0
7 match \04\01\0\0

Which would return (in any order):

1 OK
2 OK
3 OK
4 OK
5 NO
6 NO
7 YES

Or whatever. I.e. something where you can as a batch feed various
strings into a program in a batch, and then ask how some of them would
match against each other.

The reason for this is to extend something like git-multimail[1] with a
config where users can subscribe to changes to paths as declared by git
pathspecs, and be informed about which of the glob rules they defined
matched their config.

Right now you'd need to e.g. for a "git push" run each match rule for
each user with such config against each changed path in each commit
that's pushed, but plumbing like this would allow for feeding arbitrary
combinations of those in and ask what matches against what.

The reason I'm on this tangent is to ask whether if such a thing
existed, if it's something you can see something like ripgrep
using. I.e. ask git given its .gitignore and .gitattributes what thing
matches one of its pathspecs instead of carrying your own bug-compatible
implementation.

1. https://github.com/git-multimail/git-multimail/


Re: [BUG] gitignore documentation inconsistent with actual behaviour

2018-10-11 Thread dana
On 11 Oct 2018, at 05:19, dana  wrote:
>git itself happily accepts these patterns, however, apparently treating the **
>like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it
>matches / as a regular character, thus crossing directory boundaries).

Sorry for replying to myself so quickly, but i think this bit is wrong. It seems
like it actually just treats ** in this sort of pattern like a regular * — but
i haven't studied the source very closely, so maybe that's not the full story
either, i'm not sure.

dana



[BUG] gitignore documentation inconsistent with actual behaviour

2018-10-11 Thread dana
Hello,

I'm a contributor to ripgrep, which is a grep-like tool that supports using
gitignore files to control which files are searched in a repo (or any other
directory tree). ripgrep's support for the patterns in these files is based on
git's official documentation, as seen here:

  https://git-scm.com/docs/gitignore

One of the most common reports on the ripgrep bug tracker is that it does not
allow patterns like the following real-world examples, where a ** is used along
with other text within the same path component:

  **/**$$*.java
  **.orig
  **local.properties
  !**.sha1

The reason it doesn't allow them is that the gitignore documentation explicitly
states that they're invalid:

  A leading "**" followed by a slash means match in all directories...

  A trailing "/**" matches everything inside...

  A slash followed by two consecutive asterisks then a slash matches zero or
  more directories...

  Other consecutive asterisks are considered invalid.

git itself happily accepts these patterns, however, apparently treating the **
like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it
matches / as a regular character, thus crossing directory boundaries).

ripgrep's developer is loathe to reverse-engineer this undocumented behaviour,
and so the reports keep coming, both for ripgrep itself and for down-stream
consumers of it and its ignore crate (including most notably Microsoft's VS Code
editor).

My request: Assuming that git's actual handling of these patterns is intended,
would it be possible to make it 'official' and explicitly add it to the
documentation?

References (the first one is the main bug):

https://github.com/BurntSushi/ripgrep/issues/373
https://github.com/BurntSushi/ripgrep/issues/507
https://github.com/BurntSushi/ripgrep/issues/859
https://github.com/BurntSushi/ripgrep/issues/945
https://github.com/BurntSushi/ripgrep/issues/1080
https://github.com/BurntSushi/ripgrep/issues/1082
https://github.com/Microsoft/vscode/issues/24050

dana