Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

2017-05-26 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> To be clear the point of my mail was not to say "I can't think of a
> way to support both of these things, help!", obviously we can continue
> to maintain two codepaths. The point was to raise the idea that we
> could simply remove the more complex & doomed to forever be slow
> codepath.

To be clear, the point of my response was that these features must
remain.  As long as they are more convenient than sifting through
output produced by pattern matching engine that is less powerful
(which forces the user to give wider pattern than desired, to avoid
false negatives) with eyeball, having to match each pattern one by
one, instead of being able to use a combined and more efficient
single pattern, is still more efficient for the end user, which is
the point of using a computer.


Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

2017-05-26 Thread Ævar Arnfjörð Bjarmason
On Fri, May 26, 2017 at 2:58 AM, Junio C Hamano  wrote:
> Ævar Arnfjörð Bjarmason  writes:
>
>> I think it's a pointless distraction to start speculating in this
>> commit message what we're going to do with --debug it if it ever
>> starts emitting some debugging information at pattern execution time.
>
> OK.
>
>> As an aside, I'd very much like to remove both --debug and the
>> --and/--or/--all-match, gives some very rough edges in the UI and how
>> easy it is to make that feature error or segfault, I suspect you might
>> be the only one using it.
>
> I agree that rewriting "grep -e A -e B" to "grep -e A|B" as an
> optimization is an interesting possibility to look into, and I can
> understand that having to support "--and" and "--not" would
> make such an optimization harder to implement. "-e A --and -e B"
> must become "-e A.*B|B.*A" and as you get more terms your unified
> pattern will grow combinatorial, at which point you would be better
> off matching N patterns and combining the result.
>
> Ever saw a user run "ps | grep rogue | grep -v grep" to find a rogue
> process to kill?  That would not work if the rogue process's command
> line has a word "grep".  Because "git grep" is often run on files in
> order to find the location the patterns appear in, "git grep -e
> pattern | grep -v unwanted" shares the same issue--the unwanted
> pattern may appear in the filename, and the downstream "grep -v" may
> filter out a valid hit.  This is why "--not" exists [*1*].  I agree
> that emulating it within the same "concatenate patterns into one"
> optimization you are envisioning may be hard.
>
> Attempting to optimize "--all-match" would share similar difficulty
> with "--and", but your matching now must be done with the entire
> buffer and not go line-by-line.  It was meant to make it possible to
> say "find commits that avarab@ talks about both regex and log", i.e.
>
> $ git log --author=avarab@ --all-match --grep=log --grep=regex
>
> This is not something you can emulate by piping an output of grep to
> another grep.
>
> But none of the above means you have to give up optimizing.
>
> You can choose not to combine them into a single pattern if certain
> constructions are hard, and do only the easy ones.  If you think
> that harder combinations are not used very often, the result would
> be faster for many cases while not losing useful features, which is
> what we want.

To be clear the point of my mail was not to say "I can't think of a
way to support both of these things, help!", obviously we can continue
to maintain two codepaths. The point was to raise the idea that we
could simply remove the more complex & doomed to forever be slow
codepath.

Obviously there are caveats with the likes of "grep foo | grep bar"
that don't exist with "grep -e foo --and -e bar". I'm less interested
in whether we can come up with cases that wouldn't be possible if this
were removed, than if anyone's using them in practice.

I suspect that to the extent anyone uses this for common things it
could be emulated by --single-line --perl-regexp and e.g. 'foo.*bar'
instead of 'foo' --and 'bar'. I.e. we could offer to AND together your
regexes and match them over the entire content.

If someone needed something more complex we could just show an example
of piping e.g. \0-delimited commit messages into an arbitrary perl
script you provide.

Anyway, I've only looked this over a tiny bit, and I don't know
whether it's worth it to remove this, right now I was just interested
in some reports of what it was used for. I.e. whether anyone uses it
for N-level deep mixed AND/OR branches, or whether it's really just a
lazy way to concat regexes and get around the current limitation of
not being able to match across lines.

> [Footnote]
>
> *1* For human consumption, lack of "--not" may not hurt in the sense
> that there are workarounds (i.e. you can do without "| grep -v
> unwanted" and filter irrelevant ones by eyeballing).  But it is
> essential while scripting and trying to be precise.


Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

2017-05-25 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> I think it's a pointless distraction to start speculating in this
> commit message what we're going to do with --debug it if it ever
> starts emitting some debugging information at pattern execution time.

OK.

> As an aside, I'd very much like to remove both --debug and the
> --and/--or/--all-match, gives some very rough edges in the UI and how
> easy it is to make that feature error or segfault, I suspect you might
> be the only one using it.

I agree that rewriting "grep -e A -e B" to "grep -e A|B" as an
optimization is an interesting possibility to look into, and I can
understand that having to support "--and" and "--not" would
make such an optimization harder to implement. "-e A --and -e B"
must become "-e A.*B|B.*A" and as you get more terms your unified
pattern will grow combinatorial, at which point you would be better
off matching N patterns and combining the result.

Ever saw a user run "ps | grep rogue | grep -v grep" to find a rogue
process to kill?  That would not work if the rogue process's command
line has a word "grep".  Because "git grep" is often run on files in
order to find the location the patterns appear in, "git grep -e
pattern | grep -v unwanted" shares the same issue--the unwanted
pattern may appear in the filename, and the downstream "grep -v" may
filter out a valid hit.  This is why "--not" exists [*1*].  I agree
that emulating it within the same "concatenate patterns into one"
optimization you are envisioning may be hard.

Attempting to optimize "--all-match" would share similar difficulty
with "--and", but your matching now must be done with the entire
buffer and not go line-by-line.  It was meant to make it possible to
say "find commits that avarab@ talks about both regex and log", i.e.

$ git log --author=avarab@ --all-match --grep=log --grep=regex

This is not something you can emulate by piping an output of grep to
another grep.

But none of the above means you have to give up optimizing.  

You can choose not to combine them into a single pattern if certain
constructions are hard, and do only the easy ones.  If you think
that harder combinations are not used very often, the result would
be faster for many cases while not losing useful features, which is
what we want.


[Footnote]

*1* For human consumption, lack of "--not" may not hurt in the sense
that there are workarounds (i.e. you can do without "| grep -v
unwanted" and filter irrelevant ones by eyeballing).  But it is
essential while scripting and trying to be precise.


Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

2017-05-25 Thread Ævar Arnfjörð Bjarmason
On Wed, May 24, 2017 at 6:42 AM, Junio C Hamano  wrote:
> Ævar Arnfjörð Bjarmason   writes:
>
>> Rather, it's just to make the code easier to reason about. It's
>> confusing to debug this under threading & non-threading when the
>> threading codepaths redundantly compile a pattern which is never used.
>>
>> The reason the patterns are recompiled is as a side-effect of
>> duplicating the whole grep_opt structure, which is not thread safe,
>> writable, and munged during execution. The grep_opt structure then
>> points to the grep_pat structure where pattern or patterns are stored.
>>
>> I looked into e.g. splitting the API into some "do & alloc threadsafe
>> stuff", "spawn thread", "do and alloc non-threadsafe stuff", but the
>> execution time of grep_opt_dup() & pattern compilation is trivial
>> compared to actually executing the grep, so there was no point. Even
>> with the more expensive JIT changes to follow the most expensive PCRE
>> patterns take something like 0.0X milliseconds to compile at most[1].
>
> OK.
>
>> The undocumented --debug mode added in commit 17bf35a3c7 ("grep: teach
>> --debug option to dump the parse tree", 2012-09-13) still works
>> properly with this change. It only emits debugging info during pattern
>> compilation, which is now dumped by the pattern compiled just before
>> the first thread is started.
>
> When opt is passed to run(), opt->debug is still true for the first
> worker thread.  As long as opt->debug never makes difference after
> compile_grep_patterns(opt) returns, I think the change in this patch
> safe.

Right, the --debug feature only impacts pattern compilation.

> I do not know if we want to rely on it, but we can explain it
> away by saying "we'll only debug the runtime behaviour for the first
> worker only", or something, so it is not a big deal either way.

I think it's a pointless distraction to start speculating in this
commit message what we're going to do with --debug it if it ever
starts emitting some debugging information at pattern execution time.

As an aside, I'd very much like to remove both --debug and the
--and/--or/--all-match, gives some very rough edges in the UI and how
easy it is to make that feature error or segfault, I suspect you might
be the only one using it.

There are pattern matching optimizations I'd like to do that are much
more of a pain with that feature around. It's easy to AND multiple
regexes together into one match via -e, but when you have to deal with
negation and arbitrarily complex chained & parenthesized  AND/OR you
end up having to run your custom state machine on every line with
multiple regex matches per line.

The system grep doesn't have this feature, and people seem to do
without it. The motivation for it isn't explained in commit 79d3696cfb
("git-grep: boolean expression on pattern matching.", 2006-06-30), but
I suspect it's a hack around not being able to do "git grep ... | git
grep -v ...", which is how you'd do "I'd like to match this, but not
that" with the system grep.

Just supporting that would be much easier than supporting the and/or
matching machinery.


Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

2017-05-23 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason   writes:

> Rather, it's just to make the code easier to reason about. It's
> confusing to debug this under threading & non-threading when the
> threading codepaths redundantly compile a pattern which is never used.
>
> The reason the patterns are recompiled is as a side-effect of
> duplicating the whole grep_opt structure, which is not thread safe,
> writable, and munged during execution. The grep_opt structure then
> points to the grep_pat structure where pattern or patterns are stored.
>
> I looked into e.g. splitting the API into some "do & alloc threadsafe
> stuff", "spawn thread", "do and alloc non-threadsafe stuff", but the
> execution time of grep_opt_dup() & pattern compilation is trivial
> compared to actually executing the grep, so there was no point. Even
> with the more expensive JIT changes to follow the most expensive PCRE
> patterns take something like 0.0X milliseconds to compile at most[1].

OK.

> The undocumented --debug mode added in commit 17bf35a3c7 ("grep: teach
> --debug option to dump the parse tree", 2012-09-13) still works
> properly with this change. It only emits debugging info during pattern
> compilation, which is now dumped by the pattern compiled just before
> the first thread is started.

When opt is passed to run(), opt->debug is still true for the first
worker thread.  As long as opt->debug never makes difference after
compile_grep_patterns(opt) returns, I think the change in this patch
safe.  I do not know if we want to rely on it, but we can explain it
away by saying "we'll only debug the runtime behaviour for the first
worker only", or something, so it is not a big deal either way.

Thanks.


[PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

2017-05-23 Thread Ævar Arnfjörð Bjarmason
Change the pattern compilation logic under threading so that grep
doesn't compile a pattern it never ends up using on the non-threaded
code path, only to compile it again N times for N threads which will
each use their own copy, ignoring the initially compiled pattern.

This redundant compilation dates back to the initial introduction of
the threaded grep in commit 5b594f457a ("Threaded grep",
2010-01-25).

There was never any reason for doing this redundant work other than an
oversight in the initial commit. Jeff King suggested on-list in
<20170414212325.fefrl3qdjigwy...@sigill.intra.peff.net> that this
might be needed to check the pattern for sanity before threaded
execution commences.

That's not the case. The pattern is compiled under threading in
start_threads() before any concurrent execution has started by calling
pthread_create(), so if the pattern contains an error we still do the
right thing. I.e. die with one error before any threaded execution has
commenced, instead of e.g. spewing out an error for each N threads,
which could be a regression a change like this might inadvertently
introduce.

This change is not meant as an optimization, any performance gains
from this are in the hundreds to thousands of nanoseconds at most. If
we wanted more performance here we could just re-use the compiled
patterns in multiple threads (regcomp(3) is thread-safe), or partially
re-use them and the associated structures in the case of later PCRE
JIT changes.

Rather, it's just to make the code easier to reason about. It's
confusing to debug this under threading & non-threading when the
threading codepaths redundantly compile a pattern which is never used.

The reason the patterns are recompiled is as a side-effect of
duplicating the whole grep_opt structure, which is not thread safe,
writable, and munged during execution. The grep_opt structure then
points to the grep_pat structure where pattern or patterns are stored.

I looked into e.g. splitting the API into some "do & alloc threadsafe
stuff", "spawn thread", "do and alloc non-threadsafe stuff", but the
execution time of grep_opt_dup() & pattern compilation is trivial
compared to actually executing the grep, so there was no point. Even
with the more expensive JIT changes to follow the most expensive PCRE
patterns take something like 0.0X milliseconds to compile at most[1].

The undocumented --debug mode added in commit 17bf35a3c7 ("grep: teach
--debug option to dump the parse tree", 2012-09-13) still works
properly with this change. It only emits debugging info during pattern
compilation, which is now dumped by the pattern compiled just before
the first thread is started.

1. http://sljit.sourceforge.net/pcre.html

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 builtin/grep.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index b1095362fb..12e62fcbf3 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -224,7 +224,8 @@ static void start_threads(struct grep_opt *opt)
int err;
struct grep_opt *o = grep_opt_dup(opt);
o->output = strbuf_out;
-   o->debug = 0;
+   if (i)
+   o->debug = 0;
compile_grep_patterns(o);
err = pthread_create(&threads[i], NULL, run, o);
 
@@ -1167,8 +1168,6 @@ int cmd_grep(int argc, const char **argv, const char 
*prefix)
if (!opt.fixed && opt.ignore_case)
opt.regflags |= REG_ICASE;
 
-   compile_grep_patterns(&opt);
-
/*
 * We have to find "--" in a separate pass, because its presence
 * influences how we will parse arguments that come before it.
@@ -1245,6 +1244,15 @@ int cmd_grep(int argc, const char **argv, const char 
*prefix)
num_threads = 0;
 #endif
 
+   if (!num_threads)
+   /*
+* The compiled patterns on the main path are only
+* used when not using threading. Otherwise
+* start_threads() below calls compile_grep_patterns()
+* for each thread.
+*/
+   compile_grep_patterns(&opt);
+
 #ifndef NO_PTHREADS
if (num_threads) {
if (!(opt.name_only || opt.unmatch_name_only || opt.count)
-- 
2.13.0.303.g4ebf302169



[PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

2017-05-13 Thread Ævar Arnfjörð Bjarmason
Change the pattern compilation logic under threading so that grep
doesn't compile a pattern it never ends up using on the non-threaded
code path, only to compile it again N times for N threads which will
each use their own copy, ignoring the initially compiled pattern.

This redundant compilation dates back to the initial introduction of
the threaded grep in commit 5b594f457a ("Threaded grep",
2010-01-25).

There was never any reason for doing this redundant work other than an
oversight in the initial commit. Jeff King suggested on-list in
<20170414212325.fefrl3qdjigwy...@sigill.intra.peff.net> that this
might be needed to check the pattern for sanity before threaded
execution commences.

That's not the case. The pattern is compiled under threading in
start_threads() before any concurrent execution has started by calling
pthread_create(), so if the pattern contains an error we still do the
right thing. I.e. die with one error before any threaded execution has
commenced, instead of e.g. spewing out an error for each N threads,
which could be a regression a change like this might inadvertently
introduce.

This change is not meant as an optimization, any performance gains
from this are in the hundreds to thousands of nanoseconds at most. If
we wanted more performance here we could just re-use the compiled
patterns in multiple threads (regcomp(3) is thread-safe), or partially
re-use them and the associated structures in the case of later PCRE
JIT changes.

Rather, it's just to make the code easier to reason about. It's
confusing to debug this under threading & non-threading when the
threading codepaths redundantly compile a pattern which is never used.

The reason the patterns are recompiled is as a side-effect of
duplicating the whole grep_opt structure, which is not thread safe,
writable, and munged during execution. The grep_opt structure then
points to the grep_pat structure where pattern or patterns are stored.

I looked into e.g. splitting the API into some "do & alloc threadsafe
stuff", "spawn thread", "do and alloc non-threadsafe stuff", but the
execution time of grep_opt_dup() & pattern compilation is trivial
compared to actually executing the grep, so there was no point. Even
with the more expensive JIT changes to follow the most expensive PCRE
patterns take something like 0.0X milliseconds to compile at most[1].

The undocumented --debug mode added in commit 17bf35a3c7 ("grep: teach
--debug option to dump the parse tree", 2012-09-13) still works
properly with this change. It only emits debugging info during pattern
compilation, which is now dumped by the pattern compiled just before
the first thread is started.

1. http://sljit.sourceforge.net/pcre.html

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 builtin/grep.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index b1095362fb..12e62fcbf3 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -224,7 +224,8 @@ static void start_threads(struct grep_opt *opt)
int err;
struct grep_opt *o = grep_opt_dup(opt);
o->output = strbuf_out;
-   o->debug = 0;
+   if (i)
+   o->debug = 0;
compile_grep_patterns(o);
err = pthread_create(&threads[i], NULL, run, o);
 
@@ -1167,8 +1168,6 @@ int cmd_grep(int argc, const char **argv, const char 
*prefix)
if (!opt.fixed && opt.ignore_case)
opt.regflags |= REG_ICASE;
 
-   compile_grep_patterns(&opt);
-
/*
 * We have to find "--" in a separate pass, because its presence
 * influences how we will parse arguments that come before it.
@@ -1245,6 +1244,15 @@ int cmd_grep(int argc, const char **argv, const char 
*prefix)
num_threads = 0;
 #endif
 
+   if (!num_threads)
+   /*
+* The compiled patterns on the main path are only
+* used when not using threading. Otherwise
+* start_threads() below calls compile_grep_patterns()
+* for each thread.
+*/
+   compile_grep_patterns(&opt);
+
 #ifndef NO_PTHREADS
if (num_threads) {
if (!(opt.name_only || opt.unmatch_name_only || opt.count)
-- 
2.11.0