Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading
Ævar Arnfjörð Bjarmason writes: > To be clear the point of my mail was not to say "I can't think of a > way to support both of these things, help!", obviously we can continue > to maintain two codepaths. The point was to raise the idea that we > could simply remove the more complex & doomed to forever be slow > codepath. To be clear, the point of my response was that these features must remain. As long as they are more convenient than sifting through output produced by pattern matching engine that is less powerful (which forces the user to give wider pattern than desired, to avoid false negatives) with eyeball, having to match each pattern one by one, instead of being able to use a combined and more efficient single pattern, is still more efficient for the end user, which is the point of using a computer.
Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading
On Fri, May 26, 2017 at 2:58 AM, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> I think it's a pointless distraction to start speculating in this >> commit message what we're going to do with --debug it if it ever >> starts emitting some debugging information at pattern execution time. > > OK. > >> As an aside, I'd very much like to remove both --debug and the >> --and/--or/--all-match, gives some very rough edges in the UI and how >> easy it is to make that feature error or segfault, I suspect you might >> be the only one using it. > > I agree that rewriting "grep -e A -e B" to "grep -e A|B" as an > optimization is an interesting possibility to look into, and I can > understand that having to support "--and" and "--not" would > make such an optimization harder to implement. "-e A --and -e B" > must become "-e A.*B|B.*A" and as you get more terms your unified > pattern will grow combinatorial, at which point you would be better > off matching N patterns and combining the result. > > Ever saw a user run "ps | grep rogue | grep -v grep" to find a rogue > process to kill? That would not work if the rogue process's command > line has a word "grep". Because "git grep" is often run on files in > order to find the location the patterns appear in, "git grep -e > pattern | grep -v unwanted" shares the same issue--the unwanted > pattern may appear in the filename, and the downstream "grep -v" may > filter out a valid hit. This is why "--not" exists [*1*]. I agree > that emulating it within the same "concatenate patterns into one" > optimization you are envisioning may be hard. > > Attempting to optimize "--all-match" would share similar difficulty > with "--and", but your matching now must be done with the entire > buffer and not go line-by-line. It was meant to make it possible to > say "find commits that avarab@ talks about both regex and log", i.e. > > $ git log --author=avarab@ --all-match --grep=log --grep=regex > > This is not something you can emulate by piping an output of grep to > another grep. > > But none of the above means you have to give up optimizing. > > You can choose not to combine them into a single pattern if certain > constructions are hard, and do only the easy ones. If you think > that harder combinations are not used very often, the result would > be faster for many cases while not losing useful features, which is > what we want. To be clear the point of my mail was not to say "I can't think of a way to support both of these things, help!", obviously we can continue to maintain two codepaths. The point was to raise the idea that we could simply remove the more complex & doomed to forever be slow codepath. Obviously there are caveats with the likes of "grep foo | grep bar" that don't exist with "grep -e foo --and -e bar". I'm less interested in whether we can come up with cases that wouldn't be possible if this were removed, than if anyone's using them in practice. I suspect that to the extent anyone uses this for common things it could be emulated by --single-line --perl-regexp and e.g. 'foo.*bar' instead of 'foo' --and 'bar'. I.e. we could offer to AND together your regexes and match them over the entire content. If someone needed something more complex we could just show an example of piping e.g. \0-delimited commit messages into an arbitrary perl script you provide. Anyway, I've only looked this over a tiny bit, and I don't know whether it's worth it to remove this, right now I was just interested in some reports of what it was used for. I.e. whether anyone uses it for N-level deep mixed AND/OR branches, or whether it's really just a lazy way to concat regexes and get around the current limitation of not being able to match across lines. > [Footnote] > > *1* For human consumption, lack of "--not" may not hurt in the sense > that there are workarounds (i.e. you can do without "| grep -v > unwanted" and filter irrelevant ones by eyeballing). But it is > essential while scripting and trying to be precise.
Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading
Ævar Arnfjörð Bjarmason writes: > I think it's a pointless distraction to start speculating in this > commit message what we're going to do with --debug it if it ever > starts emitting some debugging information at pattern execution time. OK. > As an aside, I'd very much like to remove both --debug and the > --and/--or/--all-match, gives some very rough edges in the UI and how > easy it is to make that feature error or segfault, I suspect you might > be the only one using it. I agree that rewriting "grep -e A -e B" to "grep -e A|B" as an optimization is an interesting possibility to look into, and I can understand that having to support "--and" and "--not" would make such an optimization harder to implement. "-e A --and -e B" must become "-e A.*B|B.*A" and as you get more terms your unified pattern will grow combinatorial, at which point you would be better off matching N patterns and combining the result. Ever saw a user run "ps | grep rogue | grep -v grep" to find a rogue process to kill? That would not work if the rogue process's command line has a word "grep". Because "git grep" is often run on files in order to find the location the patterns appear in, "git grep -e pattern | grep -v unwanted" shares the same issue--the unwanted pattern may appear in the filename, and the downstream "grep -v" may filter out a valid hit. This is why "--not" exists [*1*]. I agree that emulating it within the same "concatenate patterns into one" optimization you are envisioning may be hard. Attempting to optimize "--all-match" would share similar difficulty with "--and", but your matching now must be done with the entire buffer and not go line-by-line. It was meant to make it possible to say "find commits that avarab@ talks about both regex and log", i.e. $ git log --author=avarab@ --all-match --grep=log --grep=regex This is not something you can emulate by piping an output of grep to another grep. But none of the above means you have to give up optimizing. You can choose not to combine them into a single pattern if certain constructions are hard, and do only the easy ones. If you think that harder combinations are not used very often, the result would be faster for many cases while not losing useful features, which is what we want. [Footnote] *1* For human consumption, lack of "--not" may not hurt in the sense that there are workarounds (i.e. you can do without "| grep -v unwanted" and filter irrelevant ones by eyeballing). But it is essential while scripting and trying to be precise.
Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading
On Wed, May 24, 2017 at 6:42 AM, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> Rather, it's just to make the code easier to reason about. It's >> confusing to debug this under threading & non-threading when the >> threading codepaths redundantly compile a pattern which is never used. >> >> The reason the patterns are recompiled is as a side-effect of >> duplicating the whole grep_opt structure, which is not thread safe, >> writable, and munged during execution. The grep_opt structure then >> points to the grep_pat structure where pattern or patterns are stored. >> >> I looked into e.g. splitting the API into some "do & alloc threadsafe >> stuff", "spawn thread", "do and alloc non-threadsafe stuff", but the >> execution time of grep_opt_dup() & pattern compilation is trivial >> compared to actually executing the grep, so there was no point. Even >> with the more expensive JIT changes to follow the most expensive PCRE >> patterns take something like 0.0X milliseconds to compile at most[1]. > > OK. > >> The undocumented --debug mode added in commit 17bf35a3c7 ("grep: teach >> --debug option to dump the parse tree", 2012-09-13) still works >> properly with this change. It only emits debugging info during pattern >> compilation, which is now dumped by the pattern compiled just before >> the first thread is started. > > When opt is passed to run(), opt->debug is still true for the first > worker thread. As long as opt->debug never makes difference after > compile_grep_patterns(opt) returns, I think the change in this patch > safe. Right, the --debug feature only impacts pattern compilation. > I do not know if we want to rely on it, but we can explain it > away by saying "we'll only debug the runtime behaviour for the first > worker only", or something, so it is not a big deal either way. I think it's a pointless distraction to start speculating in this commit message what we're going to do with --debug it if it ever starts emitting some debugging information at pattern execution time. As an aside, I'd very much like to remove both --debug and the --and/--or/--all-match, gives some very rough edges in the UI and how easy it is to make that feature error or segfault, I suspect you might be the only one using it. There are pattern matching optimizations I'd like to do that are much more of a pain with that feature around. It's easy to AND multiple regexes together into one match via -e, but when you have to deal with negation and arbitrarily complex chained & parenthesized AND/OR you end up having to run your custom state machine on every line with multiple regex matches per line. The system grep doesn't have this feature, and people seem to do without it. The motivation for it isn't explained in commit 79d3696cfb ("git-grep: boolean expression on pattern matching.", 2006-06-30), but I suspect it's a hack around not being able to do "git grep ... | git grep -v ...", which is how you'd do "I'd like to match this, but not that" with the system grep. Just supporting that would be much easier than supporting the and/or matching machinery.
Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading
Ævar Arnfjörð Bjarmason writes: > Rather, it's just to make the code easier to reason about. It's > confusing to debug this under threading & non-threading when the > threading codepaths redundantly compile a pattern which is never used. > > The reason the patterns are recompiled is as a side-effect of > duplicating the whole grep_opt structure, which is not thread safe, > writable, and munged during execution. The grep_opt structure then > points to the grep_pat structure where pattern or patterns are stored. > > I looked into e.g. splitting the API into some "do & alloc threadsafe > stuff", "spawn thread", "do and alloc non-threadsafe stuff", but the > execution time of grep_opt_dup() & pattern compilation is trivial > compared to actually executing the grep, so there was no point. Even > with the more expensive JIT changes to follow the most expensive PCRE > patterns take something like 0.0X milliseconds to compile at most[1]. OK. > The undocumented --debug mode added in commit 17bf35a3c7 ("grep: teach > --debug option to dump the parse tree", 2012-09-13) still works > properly with this change. It only emits debugging info during pattern > compilation, which is now dumped by the pattern compiled just before > the first thread is started. When opt is passed to run(), opt->debug is still true for the first worker thread. As long as opt->debug never makes difference after compile_grep_patterns(opt) returns, I think the change in this patch safe. I do not know if we want to rely on it, but we can explain it away by saying "we'll only debug the runtime behaviour for the first worker only", or something, so it is not a big deal either way. Thanks.