from:"Jim Meyering"

bug#64773: grep 3.11 -r on 100000+ files fails with "Operation not supported"

2023-07-21 Thread Jim Meyering

On Fri, Jul 21, 2023 at 4:38 PM Paul Eggert  wrote:
> To fix just this bug (as opposed to the other Gnulib-related bugs that
> may be lurking) try applying the attached Gnulib patch to a grep 3.11
> tarball.
>
> Closing the debbugs.gnu.org bug report, as the bug has been fixed upstream.

Thanks for reporting that.
I've just pushed the following, adding a NEWS entry for the 3.11 bug and a test.
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=v3.11-12-gd1c3fbe

bug#64235: patch: Ensure that makeinfo ≥ 6.8 checks the @menu structure

2023-06-24 Thread Jim Meyering

On Thu, Jun 22, 2023 at 4:28 PM Bruno Haible  wrote:
> Makeinfo versions < 6.7 detected mistakes in the @menu structure of TeXinfo
> input. Makeinfo versions ≥ 6.8 don't do this any more by default. They need
> an extra option, for this validation to happen. See
> .
>
> Since doc/grep.texi has a hand-written @menu, it should use this extra
> option.
>
> Here's a patch to that effect.
>
> There are two possibilities for adding the extra option:
>   - in the MAKEINFO variable, where it has an effect on both "makeinfo"
> and "makeinfo --html",
>   - in the AM_MAKEINFOFLAGS variable, where it has an effect on "makeinfo"
> only.
> Since some maintainers may check their documentation edits only by 
> regenerating
> the HTML-formatted documentation, the first choice is preferrable.
>
> Tested by running
>   touch doc/grep.texi ; (cd doc && make grep.info V=1)
> and
>   make sc_makefile_at_at_check

Thank you. Pushed.

bug#63780: Reversing the grep message output type matching binary files (without the -a option added) changed from stdout to stderr

2023-05-29 Thread Jim Meyering

tags 63780 notabug
close 63780

On Sun, May 28, 2023 at 9:56 PM 2773414454 via Bug reports for GNU
grep  wrote:
> Between grep3.4 and grep3.5, the grep message output type matching binary 
> files (without the -a option added) changed from stdout to stderr. This 
> results in the inability to pipe in matching messages, resulting in 
> significant changes to the user experience. But this modification doesn't 
> actually do much. Could you consider reversing this change?

Please read this excerpt from the NEWS and announcement for some of
the motivation for that change:

* Noteworthy changes in release 3.5 (2020-09-27) [stable]

** Changes in behavior

  The message that a binary file matches is now sent to standard error
  and the message has been reworded from "Binary file FOO matches" to
  "grep: FOO: binary file matches", to avoid confusion with ordinary
  output or when file names contain spaces and the like, and to be
  more consistent with other diagnostics.  For example, commands
  like 'grep PATTERN FILE | wc' no longer add 1 to the count of
  matching text lines due to the presence of the message.  Like other
  stderr messages, the message is now omitted if the --no-messages
  (-s) option is given.

If you want to restore such diagnostics to stdout, you can invoke grep
through a bash/zsh function wrapper like this:
(it preserves all stderr, except that one diagnostic, which it
redirects to stdout):

  grep() { local re='^grep: .*: binary file matches$'; env grep "$@"
2> >(tee  >(env grep -av "$re" 1>&2) | env grep -a "$re"); }

bug#63533: test-mbrlen5.sh failure

2023-05-18 Thread Jim Meyering

On Thu, May 18, 2023 at 2:44 PM Carlo Marcelo Arenas Belón
 wrote:
> On Wed, May 17, 2023 at 09:09:02PM -0400, Caleb Zulawski wrote:
> >
> > Isn’t this test too strict, then?
>
> It shouldn't had been included with the release; the attached patch
> should help with that for future releases.

Thanks for looking into this, Carlo.

However, I think we should not limit the gnulib tests that are run via
grep just to accommodate systems with broken code, even when grep
happens not to exercise the broken code path.

bug#63533: test-mbrlen5.sh failure

2023-05-17 Thread Jim Meyering

tags 63533 notabug
close 63533
done

On Tue, May 16, 2023 at 1:34 PM Carlo Arenas  wrote:
> That is a test for a bug that your system image has but that is not
> relevant to grep (mbrlen doesn't correctly handle a call with a len of
> 0).

Thanks for responding.
This reply (via the above) marks this issue as a non-bug and closes it.

bug#63419: Spelling patch

2023-05-10 Thread Jim Meyering

On Wed, May 10, 2023 at 3:49 PM Josh Soref  wrote:
> Some projects seem very attached to historical typos, I haven't
> checked carefully to determine if this is such a project. Do note that
> `Alain Magloire`'s is among the things that was historically
> misspelled (see 2001-02-17).
>
> There are a couple of doubled words and their existence makes it much
> harder to understand what the text is trying to say. Afict, they're
> purely errant.

Thanks. I've applied that in your name
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=16f9ca8ed1826063920f480ec341f20b0313482e

I added a tiny additional fix to your bootstrap change and propagated
that into gnulib:
https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=dfdf33a46655eea91ce0a7db5821cb99dd985c05

bug#62983: workaround PCRE2 bug affecting at least \D and \W

2023-04-30 Thread Jim Meyering

On Sun, Apr 30, 2023 at 3:15 AM Paul Eggert  wrote:
>
> On 2023-04-28 23:54, Jim Meyering wrote:
> > I've made some small adjustments and tidied up the ChangeLog in the 
> > attached.
>
> One question about that patch (both original and as revised). Why do we
> need a new binary_safe slot in struct pcre_comp? Shouldn't the
> binary_safe stuff be done at compile-time rather than run-time?
>
> Proposed revised patch attached. It also tweaks commentary slightly, and
> uses a more uniform style in the test comments (something like what
> Carlo suggested, but a bit wordier since it names the characters).

Thanks, Paul. I prefer that. Pushed.
I've also pushed an update to use the latest from gnulib.

bug#63146: Document that -f takes "-" on Info and man page?

2023-04-29 Thread Jim Meyering

On Fri, Apr 28, 2023 at 3:26 PM Sebastian Carlos  wrote:
> Currently the documentation for the -f flag in both man page and info
> are not very clear about the possibility of passing "-" to read from stdin.

Thanks for pointing that out.
I've addressed that with this just-pushed commit:


grep-doc-f-FILE-stdin.diff
Description: Binary data

bug#62983: workaround PCRE2 bug affecting at least \D and \W

2023-04-29 Thread Jim Meyering

On Fri, Apr 21, 2023 at 10:22 PM Carlo Marcelo Arenas Belón
 wrote:
> On Fri, Apr 21, 2023 at 11:42:50AM -0700, Paul Eggert wrote:
> > On 2023-04-20 19:04, Carlo Marcelo Arenas Belón wrote:
> > > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > > its JIT implementation that results in failure to match for the negative
> > > perl classes, and seems to be easier to replicate when the matching
> > > character is a multibyte one.
> >
> > Unfortunately that is a little vague. I expect the issue is not limited to
> > \D and \W, as there are other ways to specify negative Perl classes.
>
> Correct, it should also affect at least \S, but hadn't been able to trigger
> it there.
>
> The bug was that an uninitialized value was being used in the JIT code that
> supports the PCRE2_MATCH_INVALID_UTF mode. which is why I said "randomly" in
> the commit message.
>
> If you want to be strict, how about the attached patch instead?
>
> > And if
> > the bug merely seems to be easier to replicate with multibyte characters, it
> > sounds like we may have issues even when matching ASCII characters in a
> > UTF-8 locale.
>
> Which the current workaround addresses, since you need both PCRE2_JIT and
> PCRE2_MATCH_INVALID_UTF to trigger it, and the subject encoding is irrelevant
> for the logic to decide if PCRE2_MATCH_INVALID_UTF gets enabled or not.
>
> > Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We should
> > focus our optimization efforts on future PCRE2 versions, and not worry about
> > optimizing earlier versions where optimizations complicate maintenance for a
> > declining benefit, and are likely to provoke bugs in older versions that as
> > time passes will be harder to debug.
>
> Not sure I understand your concern here, but if it is about disabling JIT
> insteed, then the possibility of introducing bugs is even bigger since it
> affects all versions of PCRE2 (not only 10.34 or newer).
>
> > > Alternatively JIT could be disabled instead, but the option selected has
> > > less of an impact on performance.
> >
> > Disabling JIT sounds better, as correctness trumps performance. Until the
> > bug is fixed (or at least better-understood so that we have a workaround we
> > can trust), how about the attached patch instead?
>
> The bug has been fixed already, and will be included in the next release.
> There might be additional changes as spelled in that discussion, and indeed
> the change to the proposed solution proactively helps with one of those.
>
> It is very unlikely, but some systems might include non 0 values on the
> tables for characters over 127 and that might trigger a similar problem that
> is yet to be fixed.
>
> Carlo
>
> [1] 
> https://github.com/PCRE2Project/pcre2/commit/2c08b619dc973beacc474dcb67cda8cd366200ce

Thanks, Carlo.
I've made some small adjustments and tidied up the ChangeLog in the attached.
Hope to push it by Sunday.

There's enough going on via gnulib that I'll likely make yet another
snapshot with the very latest.

Also, there remain solaris sparc and i386 gnulib test failures:


https://buildfarm.opencsw.org/buildbot/builders/ggrep-solaris10-sparc/builds/336
  FAIL: test-c-stack.sh
  FAIL: test-year2038


https://buildfarm.opencsw.org/buildbot/builders/ggrep-solaris10-i386/builds/334
  FAIL: test-year2038


grep-pcre2.diff
Description: Binary data

bug#63016: make it easier to build with development versions of PCRE2

2023-04-29 Thread Jim Meyering

On Sat, Apr 22, 2023 at 6:08 PM Carlo Marcelo Arenas Belón
 wrote:
> Building against a different version of PCRE2 that the one that is provided
> with the system is complicated by the fact that unlike what is advertised,
> if a pkg-config module for libpcre2-8 is found, it will override the values
> that were provided with PCRE_CFLAGS and PCRE_LIBS.

Thank you. Pushed.

bug#63030: correction

2023-04-24 Thread Jim Meyering

tags 63030 notabug
done

On Sun, Apr 23, 2023 at 5:05 AM Sebastian Carlos  wrote:
> On second thought, I think the documentation is fine. I just misread it. It
> has two sentences: "Skip any COMMAND-LINE file with..." and "When searching
> recursively, skip any SUBFILE whose...". So both cases are considered and
> it doesn't imply that --exclude acts on command line files only.

Thanks for the follow-up.
This email closes the issue in our tracker.

bug#62983: workaround PCRE2 bug affecting at least \D and \W

2023-04-20 Thread Jim Meyering

On Thu, Apr 20, 2023 at 7:33 PM Jim Meyering  wrote:
>
> On Thu, Apr 20, 2023 at 7:05 PM Carlo Marcelo Arenas Belón
>  wrote:
> > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > its JIT implementation that results in failure to match for the negative
> > perl classes, and seems to be easier to replicate when the matching
> > character is a multibyte one.
> >
> > Disable that flag and use the original fallback instead.
> >
> > Alternatively JIT could be disabled instead, but the option selected has
> > less of an impact on performance.
>
> Thanks for the patch! Is there any PCRE-upstream discussion about this?
> If so, I'd like to reference that from your commit log.

Oh! I see it in the test file:
  https://github.com/PCRE2Project/pcre2/issues/224

bug#62983: workaround PCRE2 bug affecting at least \D and \W

2023-04-20 Thread Jim Meyering

On Thu, Apr 20, 2023 at 7:05 PM Carlo Marcelo Arenas Belón
 wrote:
> All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> its JIT implementation that results in failure to match for the negative
> perl classes, and seems to be easier to replicate when the matching
> character is a multibyte one.
>
> Disable that flag and use the original fallback instead.
>
> Alternatively JIT could be disabled instead, but the option selected has
> less of an impact on performance.

Thanks for the patch! Is there any PCRE-upstream discussion about this?
If so, I'd like to reference that from your commit log.

bug#60690: -P '\d' in GNU and git grep

2023-04-05 Thread Jim Meyering

On Wed, Apr 5, 2023 at 11:33 AM Paul Eggert  wrote:
> On 2023-04-04 12:31, Junio C Hamano wrote:
> > My personal inclination is to let Perl folks decide
> > and follow them (even though I am skeptical about the wisdom of
> > letting '\d' match anything other than [0-9])
>
> I looked into what pcre2grep does. It has always done only 8-bit
> processing unless you use the -u or --utf option, so plain "pcre2grep
> '\d'" matches only ASCII digits.
>
> Although this causes pcre2grep to mishandle Unicode characters:
>
>$ echo 'Ævar' | pcre2grep '[Ssß]'
>Ævar
>
> it mimics Perl 5.36:
>
>$ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
>Ævar
>
> so this seems to be what Perl users expect, despite its infelicities.
>
> For better Unicode handling one can use pcre2grep's -u or --utf option,
> which causes pcre2grep to behave more like GNU grep -P and git grep -P:
> "echo 'Ævar' | pcre2grep -u '[Ssß]'" outputs nothing, which I think is
> what most people would expect (unless they're Perl users :-).

Good argument for making PCRE2_UCP the default.

> Neither git grep -P nor the current release of pcre2grep -u have \d
> matching non-ASCII digits, because they do not use PCRE2_UCP. However,
> in a February 8 commit[1], Philip Hazel changed pcre2grep to use
> PCRE2_UCP, so this will mean 10.43 pcre2grep -u will behave like 3.9 GNU
> grep -P did (though 3.10 has changed this).
>
> That February commit also added a --no-ucp option, to disable PCRE2_UCP.
> So as I understand it, if you're in a UTF-8 locale:
>
> * 10.43 pcre2grep -u will behave like 3.9 GNU grep -P.
>
> * 10.43 pcre2grep -u --no-ucp will behave like git grep -P.
>
> * Current GNU grep -P is different from everybody else.
>
> This incompatibility is not good.
>
> Here are two ways forward to fix this incompatibility (there are other
> possibilities of course):
>
> (A) GNU grep adds a --no-ucp option that acts like 10.43 pcre2grep
> --no-ucp, and git grep -P follows suit. That is, both GNU and git grep
> act like 10.43 pcre2grep -u, in that they enable PCRE2_UTF, and also
> enable PCRE2_UCP unless --no-ucp is given. This would cause \d to match
> non-ASCII digits unless --no-ucp is given.
>
> (B) GNU grep -P and git grep -P mimic pcre2grep in both -u and --no-ucp.
> That is, they would both do 8-bit-only by default, and use PCRE2_UTF
> only when -u or --utf is given, and use PCRE2_UCP only when --no-ucp is
> absent. This would cause \d to match non-ASCII digits only when -u is
> given but --no-ucp is not.

Changing grep -P's \d to match multibyte digits by default would break
an important contract. Avoiding that feels like it must outweigh any
cross-tool portability concern.

(C)  preserve grep -P's tradition of \d matching only 0..9, and once
grep uses 10.43 or newer, \b and \w will also work as desired.

> Under either (A) or (B), future pcre2grep -u, GNU grep -P, and git grep
> -P would be consistent.

I hope git grep -P's \d will also stick to ASCII-only by default.
Those rare few who desire multibyte matches can always specify \p{Nd}
instead of \d, or (with new enough PCRE2), use (?-aD) and (?aD) to
toggle the digit-matching mode.

bug#60690: -P '\d' in GNU and git grep

2023-04-04 Thread Jim Meyering

On Mon, Apr 3, 2023 at 11:47 PM Paul Eggert  wrote:
> On 2023-04-03 20:30, Jim Meyering wrote:
> > have you seen justification
> > (other than for compatibility with some other tool or language) for
> > allowing \d to match non-ASCII by default, in spite of the risks?
>
> In the example Ævar supplied in <https://bugs.gnu.org/60690>, my
> impression was that it was better when \d matched non-ASCII digits. That
> is, in a UTF-8 locale it's better when \d finds matches in these lines:
>
> >>  > git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
> >>  > git-gui/po/ja.po:"- 第２行: 空白\n"
>
> because they contain the Japanese digits "１" and "２". This was the only
> example I recall being given.

Before it was unintentionally enabled in grep-3.9, lines like that have
never been matched by grep -P's '\d'. By relaxing \d, we'd weaken
any application that uses say grep -P '^\d+$' to perform input
validation intending to ensure that some input is all ASCII digits.
It's not a big stretch to imagine that some downstream processor
of that "verified" data is not prepared to deal with multi-byte digits.

> Also, I find it odd that grep -P '^[\w\d]*$' matches lines containing
> any sort of Arabic word characters, but it rejects lines containing
> Arabic digits like "٣" that are perfectly reasonable in Arabic-language
> text. I also find it odd that [\d] and [[:digit:]] mean different things.
>
> There are arguments on the other side, otherwise we wouldn't be having
> this discussion. And it's true that grep -P '\d' formerly rejected
> Arabic digits (though it's also true that grep -P '\w' formerly rejected
> Arabic letters...). Still, the cure's oddness and incompatibility with
> Git, Perl, etc. appears to me to be worse than the disease of dealing
> with grep -P invocations that need to use [0-9] or LC_ALL="C" anyway if
> they want to be portable to any program other than GNU grep.

I'm primarily concerned about not introducing a persistent regression in
how GNU grep's -P '\d' works in multibyte locales. The corner cases you
mention do matter, of course, but are far less likely to matter in practice.

bug#60690: -P '\d' in GNU and git grep

2023-04-03 Thread Jim Meyering

On Mon, Apr 3, 2023 at 2:39 PM Paul Eggert  wrote:
> I've recently done some bug-report maintenance about a set of GNU grep
> bug reports related to whether whether "grep -P '\d'" should match
> non-ASCII digits, and have some thoughts about coordinating GNU grep
> with git grep in this department.
>
> GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on
> Savannah's copy of GNU grep, and some sort of fix should appear in the
> next grep release. However, I'm leaving the GNU grep bug report open for
> now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly
> identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug
> found in latest stable release v3.10 of grep". I merged these related
> bug reports, and the oldest one, Bug#60690, is now the representative
> displayed in the GNU grep bug list[4].
>
> For this set of grep bug reports there's still a pending issue discussed
> in my recent email[5], which proposes a patch so I've tagged Bug#60690
> with "patch". The proposal is that GNU grep -P '\d' should revert to the
> grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match
> non-ASCII decimal digits.
>
> In researching this a bit further, I found that on March 23 Git disabled
> the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug
> that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should
> appear in the next PCRE2 release.
>
> When PCRE2 10.35 comes out,

Thanks for finding that.
It's clearly a good idea to disable PCRE2_UCP for those using those
older, known-buggy versions of pcre2.

The latest is 10.42, per https://github.com/PCRE2Project/pcre2/releases

> it appears that 'git grep -P' will behave
> like 'grep -P' only if GNU grep adopts something like the solution
> proposed in [5].
>
> [1]: https://bugs.gnu.org/62605
> [2]: https://bugs.gnu.org/60690
> [3]: https://bugs.gnu.org/62552
> [4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
> [5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg4.html
> [6]:
> https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8
> [7]:
> https://lore.kernel.org/git/7e83daa1-f9a9-4151-8d07-d80ea6d59...@clumio.com/
> [8]:
> https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8

Thanks for all of the links. However, have you seen justification
(other than for compatibility with some other tool or language) for
allowing \d to match non-ASCII by default, in spite of the risks?
IMHO, we have an obligation to retain compatibility with how grep -P
'\d' has worked since -P was added. I'd be happy to see an option to
enable the match-multibyte-digits behavior, but making it the default
seems too likely to introduce unwarranted risk.

bug#62647: [INSTALL] grep: re-fix Y2038 bug on glibc 2.34+ x86, ARM

2023-04-03 Thread Jim Meyering

On Mon, Apr 3, 2023 at 11:20 AM Paul Eggert  wrote:
> On 2023-04-03 10:52, Jim Meyering wrote:
> > I wanted to see how this would make grep fail, but don't
> > have convenient access to such hosts. Would this trigger the failure?
> >
> >touch -t 20390101 f
> >grep ^ f
>
> Yes, that triggers it. Of course one needs a "touch" and a filesystem
> that supports such timestamps.
>
> Come to think of it, the year2038 module (which coreutils also employs)
> no longer defaults to requiring year2038 support like it used to. It now
> merely enables year2038 support if available. Should we change this in
> Gnulib? This would affect coreutils, grep etc.
>
> Gnulib year2038 became milder when there was pushback about
> AC_SYS_YEAR2038 when it got added to Autoconf. The next Autoconf will
> have AC_SYS_YEAR2038 (which merely tries to get Y2038 support) and
> AC_SYS_YEAR2038_REQUIRED (which requires it).
>
> It's a controversial area because these two modules can change library
> ABIs. I suppose in theory we could add Gnulib modules largefile-required
> and year2038-required, and have coreutils, grep, etc. use these modules.
> However, this doesn't seem worth the hassle, since packages using the
> largefile and year2038 modules are typically compiled with their default
> options. So I'm sort of leaning toward modifying Gnulib's largefile and
> year2038 modules to use the _REQUIRED variants.
>
> Thoughts?

I followed that autoconf discussion and am all for requiring y2038
support in the tools we tend.

bug#62647: [INSTALL] grep: re-fix Y2038 bug on glibc 2.34+ x86, ARM

2023-04-03 Thread Jim Meyering

On Mon, Apr 3, 2023 at 10:34 AM Paul Eggert  wrote:
> The meaning of AC_SYS_LARGEFILE has changed to no longer even try
> to use wider time_t if available.  So use AC_SYS_YEAR2038 as well.
> A more-aggressive change would be to use the next Autoconf’s
> AC_SYS_YEAR2038_REQUIRED but at least let’s restore the grep 3.8
> behavior.
> * NEWS: Mention this.
> * bootstrap.conf: Add year2038.
> ---
>  NEWS   | 4 
>  bootstrap.conf | 1 +
>  2 files changed, 5 insertions(+)
>
> diff --git a/NEWS b/NEWS
> index 6ebade3..060e938 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -10,6 +10,10 @@ GNU grep NEWS-*- 
> outline -*-
>grep 3.8, in that patterns like \w and \b use ASCII rather than
>Unicode interpretations.
>
> +  grep no longer fails on files dated after the year 2038,
> +  when running on 32-bit x86 and ARM hosts using glibc 2.34+.
> +  [bug introduced in grep 3.9]
> +
>
>  * Noteworthy changes in release 3.10 (2023-03-22) [stable]
>
> diff --git a/bootstrap.conf b/bootstrap.conf
> index 50948a6..ec48c37 100644
> --- a/bootstrap.conf
> +++ b/bootstrap.conf
> @@ -102,6 +102,7 @@ windows-stat-inodes
>  xalloc
>  xbinary-io
>  xstrtoimax
> +year2038
>  '

Thanks, Paul.
I wanted to see how this would make grep fail, but don't
have convenient access to such hosts. Would this trigger the failure?

  touch -t 20390101 f
  grep ^ f

How does it fail?

bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate

2023-04-02 Thread Jim Meyering

On Sun, Apr 2, 2023 at 1:25 PM Carlo Arenas  wrote:
> On Sun, Apr 2, 2023 at 11:30 AM Paul Eggert  wrote:
> >
> > Also, GNU grep -w passes the following more-complicated regexp to dfaparse:
>
> but AFAIK `-w` is not necessary to trigger it, as the following also
> infloops in Fedora Rawhide
>
>   $ echo a | grep -E '((()|a)|())+'

FYI, this prints its input line (and no infloop) when grep is
configured --with-included-regex, so at least for that one, it may be
due to a recent change in upstream glibc.

bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate

2023-04-01 Thread Jim Meyering

On Mon, Mar 27, 2023 at 6:15 AM Koen Claessen  wrote:
> Running the command:
>
>   echo a | grep -E -w '((()|a)|())*'
>
> does not terminate, and uses a LOT of processor time, for all versions of
> grep I have tried.
>
> This is the smallest case that could be found; simplifying anything in the
> input and/or expression leads to correct behavior.

Thank you! How did you find that?

FYI, this strikes grep-3.10 (on Fedora 37/glibc-2.36-9.fc37.x86_64)
when using LC_ALL=en_US.UTF-8, but not with LC_ALL=C.
I.e., this infloops:
   echo a | LC_ALL=en_US.UTF-8 grep -E -w '((()|a)|())*'

but this works as expected and promptly prints its line of input:
 echo a | LC_ALL=C grep -E -w '((()|a)|())*'

For now, I've added an expected-failing test case for this bug:

grep-glibc-infloop.patch
Description: Binary data

bug#62272: Erroneous claim in grep man page

2023-03-19 Thread Jim Meyering

tags notabug 62272
stop

On Sun, Mar 19, 2023 at 5:47 AM David Kra  wrote:
> This is a request and further endorsement of a report from 2008, ending
> with https://lists.gnu.org/archive/html/bug-grep/2008-08/msg2.html
>
> Request: Either add four words to the manpage or delete the entire
> sentence.
> ASIS: "In GNU grep there is no difference in available functionality
> between basic and extended syntaxes."
>
> TOBE: "Although the syntaxes differ, in GNU grep there is no difference in
> available functionality between basic and extended syntaxes."
>
> Reasoning: The user should not suffer for not realizing that "no difference
> in available functionality" does not imply "no difference in syntax."

Thanks, but that sentence already says there are two different
syntaxes: basic and extended.
Those are different names, which usually implies they denote different things.
So adding those four words at the beginning of the sentence would be
unnecessary and repetitive.

I'm marking this as done, but discussion may continue.

bug#62267: grep-3.9 bug: \d matches multibyte digits

2023-03-19 Thread Jim Meyering

On Sun, Mar 19, 2023 at 4:12 PM Paul Eggert  wrote:
> On 2023-03-19 13:44, Jim Meyering wrote:
> > I've pushed your change along with the attached.
> > I'll probably create another snapshot today.
>
> Thanks. I also installed a minor dfa.c change in Gnulib yesterday to
> pacify Oracle Solaris Studio. No big deal since 'grep' builds OK anyway.
>
> I also ran into a weird issue with test-select on Fedora 37 x86-64. It
> appears to be timing dependent and usually doesn't happen. I can't
> reproduce under strace. This is another Gnulib thing and not relevant to
> grep (other than people might report test failures to bug-grep).
>
> I installed into Gnulib the attached patch which shouldn't hurt but
> which I don't know fixes the bug.

Oh! I must have missed getting the latter by bare minutes.
I've just published another snapshot (which does include the dfa.c change)
but not the select one. We'll get it for the release of 3.10

bug#62267: grep-3.9 bug: \d matches multibyte digits

2023-03-19 Thread Jim Meyering

On Sun, Mar 19, 2023 at 9:54 AM Jim Meyering  wrote:
> On Sun, Mar 19, 2023 at 1:55 AM Paul Eggert  wrote:
> >
> > On 2023-03-19 01:28, Paul Eggert wrote:
> > > Looking at the source code again, how about if we move the PCRE-specific
> > > changes from src/grep.c to src/pcresearch.c which is where it really
> > > belongs, and more importantly use the bleeding-edge
> > > PCRE2_EXTRA_ASCII_BSD macro if available?
> > >
> > > Something like the attached patch, say. This patch doesn't take your \D
> > > fixes (or the above suggestions) into account.
> >
> > Oops, that patch assumed match_lines. Also, it covered two topics in the
> > doc fix. I installed the obvious topic in the doc change, and removed
> > the match_lines assumption. Revised patch attached; please ignore the
> > patch of a half-hour ago.
>
> Thanks. It definitely belongs in pcresearch.c.
> You're welcome to push that (or I will soon).
> I've rebased my changes on top of it and am adding tests.

I've pushed your change along with the attached.
I'll probably create another snapshot today.


grep-backslash-D.patch
Description: Binary data

bug#62267: grep-3.9 bug: \d matches multibyte digits

2023-03-19 Thread Jim Meyering

On Sun, Mar 19, 2023 at 1:55 AM Paul Eggert  wrote:
>
> On 2023-03-19 01:28, Paul Eggert wrote:
> > Looking at the source code again, how about if we move the PCRE-specific
> > changes from src/grep.c to src/pcresearch.c which is where it really
> > belongs, and more importantly use the bleeding-edge
> > PCRE2_EXTRA_ASCII_BSD macro if available?
> >
> > Something like the attached patch, say. This patch doesn't take your \D
> > fixes (or the above suggestions) into account.
>
> Oops, that patch assumed match_lines. Also, it covered two topics in the
> doc fix. I installed the obvious topic in the doc change, and removed
> the match_lines assumption. Revised patch attached; please ignore the
> patch of a half-hour ago.

Thanks. It definitely belongs in pcresearch.c.
You're welcome to push that (or I will soon).
I've rebased my changes on top of it and am adding tests.

bug#62267: grep-3.9 bug: \d matches multibyte digits

2023-03-19 Thread Jim Meyering

On Sat, Mar 18, 2023 at 10:54 PM Jim Meyering  wrote:
> On Sat, Mar 18, 2023 at 5:39 PM Paul Eggert  wrote:
> > Thanks for looking into this. A couple of questions.
> >
> > First, some documentation issues. Why is PCRE2 incompatible with Perl on
> > this issue? Are there other areas where the two are incompatible?
>
> To be honest, I was not too concerned about keeping up with Perl
> and am not worried about divergence, but admit I do not like the
> implication, given the name of the option: --perl-regexp. It's always
> been "pcre-regexp" in spirit. I suppose we'll want to document that,
> eventually.
>
> > Are
> > these incompatibilities documented anywhere? Is the goal for 'grep -P'
> > to be compatible with Perl, not with PCRE2?
>
> Doesn't Perl have the same issue?
> That's why the /a and /aa match modifiers were added.
>
> > Second, although that patch focuses on \d, doesn't \D have a similar
> > problem and shouldn't it be fixed too?
>
> Good point about \D. Will adjust.

Here's an additional patch to handle \D. I've only just written it, so
it's probably wrong or incomplete somewhere. I'll review it properly
and probably improve it (could certainly add more tests in this area)
tomorrow.

By the way, have you ever used \D? I think I have not.


grep-multibyte-D.patch
Description: Binary data

bug#62267: grep-3.9 bug: \d matches multibyte digits

2023-03-18 Thread Jim Meyering

On Sat, Mar 18, 2023 at 5:39 PM Paul Eggert  wrote:
> Thanks for looking into this. A couple of questions.
>
> First, some documentation issues. Why is PCRE2 incompatible with Perl on
> this issue? Are there other areas where the two are incompatible?

To be honest, I was not too concerned about keeping up with Perl
and am not worried about divergence, but admit I do not like the
implication, given the name of the option: --perl-regexp. It's always
been "pcre-regexp" in spirit. I suppose we'll want to document that,
eventually.

> Are
> these incompatibilities documented anywhere? Is the goal for 'grep -P'
> to be compatible with Perl, not with PCRE2?

Doesn't Perl have the same issue?
That's why the /a and /aa match modifiers were added.

> Second, although that patch focuses on \d, doesn't \D have a similar
> problem and shouldn't it be fixed too?

Good point about \D. Will adjust.

bug#62267: grep-3.9 bug: \d matches multibyte digits

2023-03-18 Thread Jim Meyering

I was not happy to discover that with grep-3.9 and -P,
\d can match multibyte digits like the Arabic ones:

  $ LC_ALL=en_US.UTF-8 grep -Po '\d+' <<< '٠١٢٣٤٥٦٧٨٩'
  ٠١٢٣٤٥٦٧٨٩

grep -P has never before done that.
Of course, in the C/POSIX locale, there is no such match:

  $ LC_ALL=C grep -Po '\d+' <<< '٠١٢٣٤٥٦٧٨٩'
  [1]

TL;DR, with the attached fix, grep preprocesses each affected regexp,
changing each eligible "\d" to "[0-9]". Consider this a short-term fix.
Longer term (subject to pcre2 releases), we may instead simply add a
"(?aD)" prefix.  If you really want to match non-ASCII digits, use \p{Nd}.

For background, see the PCRE2 documentation:

  https://www.pcre.org/current/doc/html/pcre2pattern.html
  https://www.pcre.org/current/doc/html/pcre2syntax.html

which say this:

  By default, \d, \s, and \w match only ASCII characters, even in UTF-8
  mode or in the 16-bit and 32-bit libraries. However, if locale-specific
  matching is happening, \s and \w may also match characters with code
  points in the range 128-255. If the PCRE2_UCP option is set, the behaviour
  of these escape sequences is changed to use Unicode properties and they
  match many more characters.

Per upstream pcre2-10.40-112-g6277357, (?aD) does what we want:

  PCRE2_EXTRA_ASCII_BSD: This option forces \d to match only ASCII digits,
  even  when  PCRE2_UCP is  set. It can be changed within a pattern by
  means of the (?aD) option setting.

I used pcre2grep (built from master) to demonstrate how we may eventually use 
"(?aD)" under the covers:

  $ LC_ALL=en_US.UTF-8 ./pcre2grep --color -u '(?aD)\d' <<< '٠١٢٣٤٥٦٧٨٩'
  [Exit 1]
  $ LC_ALL=en_US.UTF-8 ./pcre2grep --color -u '(?aD)^\d+$' <<< '٠١٢٣٤٥٦٧٨٩'
  ٠١٢٣٤٥٦٧٨٩

For the record, https://github.com/PCRE2Project/pcre2 currently declares
10.42 to be the latest, while there's a commit suggesting it's 10.43.
The difference is important: the 10.43 has support for (?aD), while
10.42 does not.

Incidentally, you can demonstrate this in python3, too:

  $ LC_ALL=en_US.UTF-8 python3 \
-c "import re; print(re.match(r'\d+', '٠١٢٣٤٥٦٧٨٩'))"
  

Use flags=re.ASCII to get the often-desired behavior:

  $ LC_ALL=en_US.UTF-8 python3 \
 -c "import re; print(re.match(r'\d+', '٠١٢٣٤٥٦٧٨٩', flags=re.ASCII))"
  None

This is cause for a new snapshot today and soon thereafter,
the release of grep-3.10.

>From 0daefc8c5659e79149a650d97ca12b49ad5e6548 Mon Sep 17 00:00:00 2001
From: Jim Meyering 
Date: Sat, 18 Mar 2023 08:28:36 -0700
Subject: [PATCH] grep: -P (--perl-regexp) \d: match only ASCII digits

Prior to grep-3.9, the PCRE matcher had always treated \d just
like [0-9]. grep-3.9's fix for \w and \b mistakenly relaxed \d
to also match multibyte digits.
* src/grep.c (P_MATCHER_INDEX): Define enum.
(pcre_pattern_expand_backslash_d): New function.
(main): Call it for -P.
* NEWS (Bug fixes): Mention it.
* doc/grep.texi: Document it: with -P, \d matches only ASCII digits.
Provide a PCRE documentation URL and an example of how
to use (?s) with -z.
* tests/pcre-ascii-digits: New test.
* tests/Makefile.am (TESTS): Add that file name.
---
 NEWS| 10 +
 doc/grep.texi   | 31 
 src/grep.c  | 82 -
 tests/Makefile.am   |  1 +
 tests/pcre-ascii-digits | 31 
 5 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100755 tests/pcre-ascii-digits

diff --git a/NEWS b/NEWS
index 803e14b..a24cebd 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,16 @@ GNU grep NEWS-*- outline -*-

 * Noteworthy changes in release ?.? (-??-??) [?]

+** Bug fixes
+
+  With -P, \d now matches only ASCII digits, regardless of PCRE
+  options/modes. The changes in grep-3.9 to make \b and \w work
+  properly had the undesirable side effect of making \d also match
+  e.g., the Arabic digits: ٠١٢٣٤٥٦٧٨٩.  With grep-3.9, -P '\d+'
+  would match that ten-digit (20-byte) string. Now, to match such
+  a digit, you would use \p{Nd}.
+  [bug introduced in grep 3.9]
+

 * Noteworthy changes in release 3.9 (2023-03-05) [stable]

diff --git a/doc/grep.texi b/doc/grep.texi
index 621beaf..eaad6e1 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1141,6 +1141,37 @@ combined with the @option{-z} (@option{--null-data}) option, and note that
 @samp{grep@ -P} may warn of unimplemented features.
 @xref{Other Options}.

+For documentation, refer to @url{https://www.pcre.org/}, with these caveats:
+@itemize
+@item
+@samp{\d} always matches only the ten ASCII digits, regardless of locale or
+in-regexp directives like @samp{(?aD)}.
+Use @samp{\p@{Nd@}} if you require to match non-ASCII digits.
+Once pcre2 support for @samp{(?aD)} is widespread enough,
+we expect to make that the default, so it will be overridable.
+@c Using pcre2 git commit pcre2-10.40-112-g6277357, this demonstrates how
+@c we'll pref

bug#62052: _N_GNU_nonoption_argv_flags_ is no longer supported

2023-03-09 Thread Jim Meyering

On Wed, Mar 8, 2023 at 7:39 AM Emanuele Torre  wrote:
> I have noticed the _N_GNU_nonoption_argv_flags_ (where N is the pid of
> grep) environment variable mentioned in the documentation. I tried to
> play with it, but it does not seem to work:
>
>  bash-5.1$ (declare -x _"$BASHPID"_GNU_nonoption_argv_flags_=111
>  > exec grep -e)
>  grep: option requires an argument -- 'e'
>  Usage: grep [OPTION]... PATTERNS [FILE]...
>  Try 'grep --help' for more information.
>
> I have checked gnulib's changelog and it looks like support for it has
> been removed in 2017, and before that it has not been enabled by default
> since 2001. (and, as far as I can tell, GNU grep never explicitly
> enabled it.)
>
> Furthermore, this environment variable used to be set this environment
> variable automatically by bash up to version 2.0, but since version 2.01
> (released in 1997) bash has stopped using it.
>
> I think, at this point, it would be best to not mention that environment
> variable in the documentation; it has not been used or even supported
> for a long time so it is just confusing.

Thank you for noticing and reporting that.
Done with the attached.


grep-_N_GNU_nonoption_argv_flags_.patch
Description: Binary data

bug#60697: GNU grep mishandles \b near encoding errors

2023-01-11 Thread Jim Meyering

On Mon, Jan 9, 2023 at 10:16 PM Paul Eggert  wrote:
> Here's a shell session illustrating the problem on Fedora 37, which has
> GNU grep 3.7. The same bug is still in bleeding-edge GNU grep.
>
>$ export LC_ALL=en_US.utf8
>$ printf '\300\n' | grep '\b'
>grep: (standard input): binary file matches
>$ printf '\300\n' | grep -P '\b'
>$
>
> Plain grep finds a word boundary in the input even though the input
> contains no words (just an encoding error). 'grep -P' does the right thing.
>
> The underlying issue is in the glibc regex code so the fix should be in
> glibc / Gnulib, but I thought I'd report it here before I forgot it.

Thanks! While this would definitely be nice to fix before the release
(in the next week or so), it's enough of a corner case that I wouldn't
feel bad releasing without a fix.

For the record, this problem first arose in grep-2.19.

bug#60708: pcre: improve support for linking with a library without unicode

2023-01-10 Thread Jim Meyering

On Tue, Jan 10, 2023 at 3:19 AM Carlo Arenas  wrote:
> Noticed while testing the previous patch, and which resulted in tests
> being skipped for the wrong reason.

Thanks for catching that.
I'll push the following tomorrow.
It has a tiny change that moves the declaration of "unicode" down to
just before where it's set and changes its type to uint32_t.

pcre-no-unicode.diff
Description: Binary data

bug#60618: unicode characters are not identified as such for \w and \b with -P

2023-01-07 Thread Jim Meyering

On Fri, Jan 6, 2023 at 11:37 PM Jim Meyering  wrote:
> On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering  wrote:
> > On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas  wrote:
> > > Reported to PCRE[1] with mention of GNU grep being also affected.
> > >
> > > [1] https://github.com/PCRE2Project/pcre2/issues/185
> >
> > Yikes. This is a big deal.
> > Thank you for the patch and added test.

I've also added the new names to THANKS.in and pushed this:
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=5e3b760f65f13856e5717e5b9d935f5b4a615be3

bug#60618: unicode characters are not identified as such for \w and \b with -P

2023-01-06 Thread Jim Meyering

On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering  wrote:
> On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas  wrote:
> > Reported to PCRE[1] with mention of GNU grep being also affected.
> >
> > [1] https://github.com/PCRE2Project/pcre2/issues/185
>
> Yikes. This is a big deal.
> Thank you for the patch and added test.
> I made a tiny comment tweak and this test logic change that was
> required to make the new test pass with the fixed version.
>
> -grep -Po 'r\w' in > out && fail=1
> +grep -Po 'r\w' in > out || fail=1
>
> Also, make syntax-check required to change e.g.,
>
> -compare out exp || fail=1
> +compare exp out || fail=1
>
> Every bug fix needs a NEWS entry, so I added this:
>
>   With -P, some non-ASCII UTF8 characters were not recognized as
>   word-constituent due to our omission of the PCRE_UCP flag. E.g.,
>   given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
>   this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
>   After the fix, it prints the correct results: "rú:ú".
>
> Finally, I expanded the ChangeLog entry and gave credit where due.
>
> I'll push this tomorrow:

Must also mention Karl Pettersson in the ChangeLog:

pcre: use UCP in UTF mode

This fixes a serious bug affecting word-boundary and word-constituent regular
expressions when the desired match involves non-ASCII UTF8 characters.
* src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF
* tests/pcre-utf8-w: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention this.
Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777
via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185
This bug was present from grep-2.5, when --perl-regexp (-P) support was added.

bug#60618: unicode characters are not identified as such for \w and \b with -P

2023-01-06 Thread Jim Meyering

On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas  wrote:
> Reported to PCRE[1] with mention of GNU grep being also affected.
>
> [1] https://github.com/PCRE2Project/pcre2/issues/185

Yikes. This is a big deal.
Thank you for the patch and added test.
I made a tiny comment tweak and this test logic change that was
required to make the new test pass with the fixed version.

-grep -Po 'r\w' in > out && fail=1
+grep -Po 'r\w' in > out || fail=1

Also, make syntax-check required to change e.g.,

-compare out exp || fail=1
+compare exp out || fail=1

Every bug fix needs a NEWS entry, so I added this:

  With -P, some non-ASCII UTF8 characters were not recognized as
  word-constituent due to our omission of the PCRE_UCP flag. E.g.,
  given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
  this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
  After the fix, it prints the correct results: "rú:ú".

Finally, I expanded the ChangeLog entry and gave credit where due.

I'll push this tomorrow:

grep-pcre-fix.diff
Description: Binary data

bug#60038: grep 2.20 - invalid option with search pattern "-/"

2022-12-13 Thread Jim Meyering

tags 60038 notabug
thanks

On Tue, Dec 13, 2022 at 10:04 AM Daniel Schättgen
 wrote:
> When searching for a pattern that includes "-/", the pattern is interpreted 
> as option:
>
> [dsg@db01]# grep "-" example.txt
> --/--
> [dsg@db01]# grep "/" example.txt
> --/--
> [dsg@db01]# grep "-/" example.txt
> grep: invalid option -- '/'
> Usage: grep [OPTION]... PATTERN [FILE]...
> Try 'grep --help' for more information.
> [dsg@db01]# grep --version
> grep (GNU grep) 2.20

Thanks, but this is not a bug.
To search for a pattern that starts with "-", use grep's -e option, e.g.,

  grep -e -/ example.txt

Also, you're using grep-2.20, which is more than 8 years old.
The latest is grep-3.8.

bug#57604: [ef]grep usage -> POSIXLY_CORRECT?

2022-09-16 Thread Jim Meyering

On Fri, Sep 16, 2022 at 8:12 AM Simon Josefsson  wrote:
> Jim Meyering  writes:
>
> > This would be an envvar for which we do not commit to any level of
> > support in future releases.
>
> Would the envvar be documented?  Would it be a deprecated feature, with
> a removal plan?  It seems we traded removing [ef]grep into introducing
> new unsupported features which feels a bit unsatisfying...
>
> How about saying that the envvar, together with all remaining traces of
> [ef]grep references will be removed in 2025?
>
> The point with the excercise was (at least to me) to remove complexity,
> but it seems we will have to wait some more until that can happen.

Hi Simon,
I agree that adding temporary complexity (and then
documenting/announcing it) to aid in the transition feels wrong.
I'm still not sure, but if we were to add such things, they would most
definitely come with a planned removal date, likely before 2025.

bug#57604: [ef]grep usage -> POSIXLY_CORRECT?

2022-09-16 Thread Jim Meyering

On Thu, Sep 15, 2022 at 9:28 PM Guillem Jover  wrote:
> On Fri, 2022-09-09 at 11:41:49 -0500, Paul Eggert wrote:
> > On 9/9/22 07:16, Guillem Jover wrote:
> > > There are now packages that fail to work such as
> > > apt-file (https://bugs.debian.org/1019329),
> >
> > From what I can see, that bug report doesn't say that apt-file fails to
> > work, only that apt-file issues a warning and then goes on to work.
>
> Ah, you are right, that might have coincided with a query I did that
> returned nothing then, sorry about that!
>
> > > Transitioning away from fgrep/egrep seems like it would be painful as
> > > that involves lots of upstream projects:
> >
> > I glanced at those, and didn't see any projects that will stop working, only
> > projects that will see annoying warnings. Admittedly I didn't look at all
> > the examples, but in the first page of
> >  (your first
> > citation) all the code examples should continue to work.
> >
> > Could you give examples of programs that actually stop working? That would
> > help us consider remedies.
>
> It's true that most of those instances are probably not going to fail.
> But what is definitely being affected are autopkgtests from Debian
> packages for example. By default those consider any output to stderr
> a signal to mark the test as failing. So the new grep failures are
> causing unrelated tests to fail now.
>
> Some are going to be hard to fix locally, or quickly everywhere, for
> example the one in libtool, as until it is fixed, relibtoolizing will
> have not effect, and afterwards that will get fixed only as long as
> the packaging always forcibly relibtoolizes (or autoreconfigures).
> 
>
>
> As I've mentioned earlier, personally, I definitely want to be able to
> see those kinds of warnings so that I can fix or change stuff I
> maintain, or report bugs with patches. But unfortunately it seems this
> is causing enough disruption that all the new warnings might end up
> being disabled in Debian. I think it's been discussed earlier that
> environment variables are not desired? But I think it would still be nice
> to be able to control those warnings globally/externally, so that even
> if say a project like Debian ends up disabling them, people can still
> enable them to be able to diagnose and track those down.

Thanks for the feedback.

We may make a new release with two additions:
- an envvar to control [ef]grep warnings, enabled by default
- a configure-time option to make it disabled by default

This would be an envvar for which we do not commit to any level of
support in future releases.

bug#57604: [ef]grep usage -> POSIXLY_CORRECT?

2022-09-11 Thread Jim Meyering

On Thu, Sep 8, 2022 at 4:01 PM Karl Berry  wrote:
> Hi Jim,
>
> Some must care about portability,
>
> Certainly agreed. Even I do, sometimes :). But that does not mean
> everyone needs to, in every situation.  As I said, I fail to understand
> the benefit of making the warning unconditional.
>
> So far as I can see, it's also against GNU principles, as I wrote,
> though evidently you don't agree.
>
> and these warnings help them do a better job.
>
> When people want extreme POSIX compliance, they should set
> POSIXLY_CORRECT. That's what it's there for, and that's when I think the
> warnings should be issued, as I said at the beginning.
>
> But since Paul rejected that, ok, a different variable that lets us turn
> them off (GREPWARNINGS=efgrepok or whatever) would at least provide some
> palliation. I don't understand why you two are opposed to this simple
> remediation.
>
> As Gary mentioned above, it's easy to disable them.
>
> Obviously it is trivial to edit the scripts or have a different version
> in PATH for my own machine(s).  But those are no substitute for having a
> supported way to use the distributed [ef]grep without warnings.
>
> I would argue that it is even more important to retain these
> stray-backslash warnings, because they tend to highlight real bugs.
>
> "tend" being the key word there. But anyway, I see your point, and won't
> argue that one further, since the efgrep warnings are what's causing me
> the agony. -k

Hi Karl,

It would help if you could point to some malfunction.

Consider the alternative.

Should we release a new version of grep that provides a documented way
(say a configure-time option) to disable a warning about a
long-deprecated feature so you don't have to manually tweak the
four-line fgrep and egrep scripts? AFAIK, these new warnings cause no
malfunction.

Wouldn't it be better to fix the roots of the problem rather than
piling another kludge on top to disable the annoying warnings? Think
about the next steps: when more and more distros cease to distribute
the egrep and fgrep crutches, what will people do? Eventually, we'll
all break the habit, at least in scripts. If you want to use it in
personal scripts or on the command line, create your wrapper script or
alias/function.

bug#57604: [ef]grep usage -> POSIXLY_CORRECT?

2022-09-08 Thread Jim Meyering

Hi Karl,

Sorry to cause you grief, but...

On Wed, Sep 7, 2022 at 7:49 PM Karl Berry  wrote:
>
> [ef]grep
>
> I guess my basic issue is that I don't understand the benefit of the new
> warning.  It causes a lot of trouble.  What is the countervailing
> positive benefit?

Some must care about portability, and these warnings help them do a better job.
As Gary mentioned above, it's easy to disable them.

> $ grep '\Q' /dev/null
> grep: warning: stray \ before Q

> It would be nice to be able to turn those off too. (It hit me today.)

I would argue that it is even more important to retain these
stray-backslash warnings, because they tend to highlight real bugs.
Consider these uses of \d:

  $ echo d | grep-3.7 '\d'
  d
  $ echo d | grep-3.8 '\d'
  grep: warning: stray \ before d

Anyone used to PCRE regexps (who isn't, these days) knows that its
"\d" is intended to match a digit, not the letter "d". With grep-3.7,
you'd get misbehavior and no warning about your error. With grep-3.8,
you'll get the diagnostic and maybe switch to using "grep -P", where
"\d" works as expected -- switching from \d to [0-9] hurts readability
and feels like dumbing-down, especially when there are two or more \d
uses. Using PCRE's \Q...\E groups *without -P* is another issue that
is now diagnosed.

For example, the following upstream projects have misuses of grep that
are exposed by running this:

  git grep 'grep .*\\[dQE]' | grep -ve '-[[:alnum:]]*P'

- linux
scripts/checkpatch.pl:  `grep -Eq
"\\"\\^\Q$vendor\E,\\.\\*\\":" $vp_file`;

- gcc
libgo/go/cmd/go/testdata/script/mod_get_lazy_indirect.txt:grep
'rsc.io/quote v\d+\.\d+\.\d+ // indirect$' go.mod
libgo/go/cmd/go/testdata/script/mod_get_lazy_indirect.txt:! grep
'rsc.io/quote v\d+\.\d+\.\d+$' go.mod
libgo/go/cmd/go/testdata/script/mod_get_lazy_indirect.txt:grep
'rsc.io/quote v\d+\.\d+\.\d+$' go.mod
libgo/go/cmd/go/testdata/script/mod_get_lazy_indirect.txt:! grep
'rsc.io/quote v\d+\.\d+\.\d+ // indirect$' go.mod

bug#56888: 'echo message | grep []' is affected by files in local directory when using bracket

2022-08-03 Thread Jim Meyering

tags 56888 + notabug
close 56888
stop

On Tue, Aug 2, 2022 at 9:08 AM Carlo Arenas  wrote:
>
> This behaviour is expected and described in the manual (albeit it
> might be a good candidate for a FAQ) :
>
>   https://www.gnu.org/software/grep/manual/grep.html#Usage
>
> Even before grep gets to see the expression, the shell would try to
> match it and expand it as needed, which is obviously not what you want
> for your usecase and why it would be better if you quote it.
>
>   time echo "axyz" | grep '[abcd]xyz'
>
> should behave as you expect, regardless of what the current directory has.

Thanks. By the above, I've marked this as "not a bug" and closed the
auto-generated ticket.

bug#56697: v. 3.7: Typo in man page

2022-07-22 Thread Jim Meyering

On Fri, Jul 22, 2022 at 3:02 AM Andrea Greselin
 wrote:
> Hi, there's a minor bug in grep's man page (at least in Fedora 36's version
> of GNU grep 3.7). In the description of the flag '-P' there's written
> Interpret I  as…
> with the markup instruction visible, instead of "PATTERNS" being in italics
> or underlined.

Thanks for the report.
That was fixed in January via
https://git.sv.gnu.org/gitweb/?p=grep.git;a=commitdiff;h=v3.7-47-gf31ae6d

bug#55641: Using colours with grep

2022-05-29 Thread Jim Meyering

On Sat, May 28, 2022 at 5:18 PM Paul Eggert  wrote:
> On 5/28/22 11:19, goncholden wrote:
> > I agree on removing GREP_COLOR entirely.
>
> Sounds good to me too. Proposed patch attached. I haven't installed
> this, as I'd like Jim's opinion (we're reasonably close to a release I
> think).

Thanks for writing that.

Yes, I'm close to making a release, indeed, but I do like this change
and it is only a sometimes-triggered warning, so please go ahead.
Adding a test for the new behavior would be nice.

bug#39678: POSIXLY_CORRECT removal, and oddball regex doc

2022-05-21 Thread Jim Meyering

On Sat, May 21, 2022 at 3:05 AM Paul Eggert  wrote:
> Looking again at grep bug 39678  I noticed
> that the bug occurs even when grep is not coloring:
>
> echo a | grep -oi --color=never '\a'
>
> This outputs nothing and exits with status 0, which is clearly wrong.

Wrong, indeed!

> I tracked this down to a bug buried deep in the bowels of glibc regex, a
> bug that Tomasz also spotted. It's not trivial to fix (the fix that
> Tomasz sent in doesn't feel right, at least for \X where X is a
> multibyte letter) and any fix would be low priority since the bug occurs
> only in regular expressions like '\a' that have unspecified behavior -
> which means the behavior though wrong nevertheless conforms to POSIX.
>
> I'm inclined to address this by having GNU 'grep' diagnose unspecified
> regexps like '\a' and exit with status 2, much as it already diagnoses
> unspecified regexps like '[:alpha:]'. If this approach sounds too
> drastic, a gentler approach would be for 'grep' to warn about '\a'
> without changing the exit status for now, and escalate the warning to
> exit with status 2 in a later 'grep' release.

In my experience, there are many lurking uses of things like '\a', and
would like to ease into this gently, so I much prefer your latter
approach: warn now, and change grep's exit status later -- at least 6
months to a year later, to give people
at least a chance to fix their offending uses before they break.

Thanks for all the improvements!

bug#54043: Simple regexp bug [contains spoiler for today's wordle]

2022-02-17 Thread Jim Meyering

On Thu, Feb 17, 2022 at 7:46 AM Matthew Wilcox  wrote:
> I noticed this one while doing:
>
> $ grep sha[^s]e five-letter-words
> share
>
> which doesn't fit with:
>
> $ grep sha.e five-letter-words
> shade
> shake
> shale
> shame
> shape
> share
> shave
>
> A reproducer is easy:
>
> $ echo shame |grep sha[^s]e
> (no output)

This is not a bug in grep. Your failure to quote the regular expression
means that the argument is first interpreted by the shell.
To demonstrate the argument that "grep" ends up using,
run this from that same directory:

  echo sha[^s]e

If I have something named e.g., "shape" in the current directory, that
would print "shape". If I have two matching names, e.g., shave and shale,
it will print both names.

IMHO, it is almost always best to single-quote regular expressions like that.
Quoting your reproducer, you see it works as desired:

  $ echo shame |grep 'sha[^s]e'
  shame

bug#54006: Duplicate test

2022-02-15 Thread Jim Meyering

On Mon, Feb 14, 2022 at 8:42 PM Ulrich Eckhardt
 wrote:
> there's a duplicate test in tests/empty, see attached patch.
>
> I'm working on a few other improvements, so please provide feedback if
> there's anything I should do differently in the future.

Thank you. I've applied that, edited the commit log to make it
conform, and pushed.
See the file HACKING for policy on how to write Emacs-style ChangeLog
entries for grep commit logs. E.g., title is not a sentence and
usually has no punctuation. And each affected file name is called out.
Sometimes function names are also listed, e.g.,

* dir/file_name (func1, func2): Description.

bug#52679: Errors in grep man pages

2021-12-20 Thread Jim Meyering

On Mon, Dec 20, 2021 at 8:23 AM Helge Kreutzmann  wrote:
>
> Dear grep maintainer,
> the manpage-l10n project maintains a large number of translations of
> man pages both from a large variety of sources (including grep) as
> well for a large variety of target languages.
>
> During their work translators notice different possible issues in the
> original (english) man pages. Sometimes this is a straightforward
> typo, sometimes a hard to read sentence, sometimes this is a
> convention not held up and sometimes we simply do not understand the
> original.
>
> We use several distributions as sources and update regularly (at
> least every 2 month). This means we are fairly recent (some
> distributions like archlinux also update frequently) but might miss
> the latest upstream version once in a while, so the error might be
> already fixed. We apologize and ask you to close the issue immediately
> if this should be the case, but given the huge volume of projects and
> the very limited number of volunteers we are not able to double check
> each and every issue.
>
> Secondly we translators see the manpages in the neutral po format,
> i.e. converted and harmonized, but not the original source (be it man,
> groff, xml or other). So we cannot provide a true patch (where
> possible), but only an approximation which you need to convert into
> your source format.
>
> Finally the issues I'm reporting have accumulated over time and are
> not always discovered by me, so sometimes my description of the
> problem my be a bit limited - do not hesitate to ask so we can clarify
> them.
>
> I'm now reporting the errors for your project. If future reports
> should use another channel, please let me know.
>
> Man page: grep.1
> Issue: IEPATTERNSE → I
>
> "Interpret IEPATTERNSE as Perl-compatible regular expressions "
> "(PCREs).  This option is experimental when combined with the B<-z> (B<-\\^-"
> "null-data>)  option, and B may warn of unimplemented features."
> --
> Issue: BEpcrepatternE(3) → B(3)
>
> "B understands three different versions of regular expression syntax: "
> "``basic'' (BRE), ``extended'' (ERE) and ``perl'' (PCRE).  In GNU B "
> "there is no difference in available functionality between basic and extended 
> "
> "syntaxes.  In other implementations, basic regular expressions are less "
> "powerful.  The following description applies to extended regular "
> "expressions; differences for basic regular expressions are summarized "
> "afterwards.  Perl-compatible regular expressions give additional "
> "functionality, and are documented in BEpcresyntaxE(3) and "
> "BEpcrepatternE(3), but work only if PCRE support is enabled."

Thank you for the reports.
The above issues are no longer present in the latest sources.

> --
> Issue: Not in "grep --help"; remove?
>
> "B<-y>"

This is the desired state: documented in texinfo only (not in --help
or man page), because the option
is on its way to being deleted.

> --
> Issue: --invert-match is above, not below.
>
> "Suppress normal output; instead print a count of matching lines for each "
> "input file.  With the B<-v>, B<-\\^-invert-match> option (see below), count "
> "non-matching lines."

Thanks for that. I have corrected it with a patch in your name here:
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=95440891d04762d8112b6ae858b9b00932b573d5

bug#47264: [PATCH] pcre: migrate to pcre2

2021-11-07 Thread Jim Meyering

Thanks Carlos for working on that and Paul for the speedy feedback!
I won't be able to spend time on this for the next couple of weeks.

bug#51231: disregard patch

2021-10-16 Thread Jim Meyering

tags 51231 notabug
stop

On Sat, Oct 16, 2021 at 12:29 AM Carlo Arenas  wrote:
> And of course it has side effects (as shown by the test suite), and
> would only help (if fixed) when the needle is a fixed string, which is
> 3x slower than doing -F, -G or -E.
>
> Apologies for the distraction.

Marking this as "notabug" via first lines above and (via the "-done"
in recipient of 51231-d...@debbugs.gnu.org) closing the issue.

bug#46227:

2021-10-14 Thread Jim Meyering

On Thu, Oct 14, 2021 at 12:03 AM Sam James  wrote:
> I'm sorry for missing your earlier question -- yes, it's working great in 3.7,
> and I really appreciate the help from you both.

Thanks for confirming it's resolved.
Closing this ticket.

bug#50093: djb2 correction

2021-08-18 Thread Jim Meyering

On Tue, Aug 17, 2021 at 11:04 PM Paul Eggert  wrote:
> On 8/17/21 3:32 AM, Jim Meyering wrote:
> > -  size_t h = 0;
> > +  size_t h = 5381;
>
> I expect DJB chose that number because of the primeth recurrence
> sequence <https://oeis.org/A007097>:
>
> 2 is 1st prime.
> 3 is 2nd prime.
> 5 is 3rd prime.
> 11 is 5th prime.
> 31 is 11th prime.
> 127 is 31st prime.
> 709 is 127th prime.
> 5381 is 709th prime.
> 52711 is 5381st prime.
> ...
>
> Although 5381 is the largest number in this sequence that can fit into
> 'int' in a portable C program, and that's probably why DJB chose 5381,
> we're not limited to such small values here.
>
> How about the attached patch instead?

I prefer that, indeed. Thanks.

bug#50093: djb2 correction

2021-08-17 Thread Jim Meyering

Alex Murray noticed that my djb2 implementation mistakenly initialized
to 0, rather than to 5381. Corrected with this:

>From 54590ca833dba62041af045e7bc7c09b90b54b71 Mon Sep 17 00:00:00 2001
From: Alex Murray 
Date: Tue, 17 Aug 2021 03:24:37 -0700
Subject: [PATCH] grep: correct DJB2 initialization

* src/grep.c (hash_pattern): DJB2 starts with 5381, not 0.
---
 src/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/grep.c b/src/grep.c
index 271b6b9..8fed550 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -126,7 +126,7 @@ static Hash_table *pattern_table;
 static size_t _GL_ATTRIBUTE_PURE
 hash_pattern (void const *pat, size_t n_buckets)
 {
-  size_t h = 0;
+  size_t h = 5381;
   intptr_t pat_offset = (intptr_t) pat - 1;
   unsigned char const *s = (unsigned char const *) pattern_array + pat_offset;
   for ( ; *s != '\n'; s++)
--

bug#49996: egrep and fgrep?

2021-08-16 Thread Jim Meyering

On Sun, Aug 15, 2021 at 8:11 PM Paul Eggert  wrote:
>
> On 8/15/21 7:25 AM, arn...@skeeve.com wrote:
> > I'm willing to bet that the majority of grep/egrep/fgrep invocations
> > come from the command line rather than from scripts.
>
> That's not true for me, as I almost always invoke 'grep' via a script or
> shell function or Emacs. And I'm skeptical that it's true worldwide.
>
> I doubt whether we should remove egrep and fgrep immediately, as that'd
> be a bit sudden. But it's a good time to take the next step. Proposed
> patch attached.

Thanks for writing that, Paul. The patch looks fine.
The irony... that one of our own tests used fgrep!
One nit in the commit log:
> * src/egrep.sh: Issue a obsolescence warning.

s/a/an/

bug#46227: Test failure on SPARC (stack-overflow)

2021-08-14 Thread Jim Meyering

On Tue, Aug 10, 2021 at 6:53 AM Paul Eggert  wrote:
> The stack-overflow bug you reported  appears
> to be fixed on SPARC Solaris 10. Could you please try again on SPARC
> Gentoo? You can use the new grep snapshot, announced here:
>
> https://lists.gnu.org/r/grep-devel/2021-08/msg3.html
>
> Thankss.

Thanks for the report.
It sounds like this is resolved, so I will not delay the release for it.
Please let us know if it is really resolved.

bug#49983: grep-3.6.27-20b4 tests fail on Cygwin

2021-08-14 Thread Jim Meyering

On Wed, Aug 11, 2021 at 9:36 AM Paul Eggert  wrote:
>
> On 8/11/21 12:20 AM, Gary Johnson wrote:
> > The
> > main thing I'm lacking is time.
>
> I *quite* understand.
>
> Unfortunately I don't use Cygwin so can't help much on Cygwin-specific
> stuff.

Thanks for reporting and investigating.

While understanding/fixing this surrogate-pair test failure on Cygwin
would be nice, it is niche enough that it need not hold up the
release.

bug#49996: egrep and fgrep?

2021-08-14 Thread Jim Meyering

On Wed, Aug 11, 2021 at 9:35 AM Paul Eggert  wrote:
> On 8/11/21 12:22 AM, Simon Josefsson via Bug reports for GNU grep wrote:
> > I think the main reason for deprecating
> > them was that POSIX dropped a requirement for them?
>
> As I recall, it was because they were kinda useless cruft. Portable
> scripts can't use egrep and fgrep since they're not standardized, and
> for personal command-line usage aliases suffice and 'eg' is a better
> alias anyway (it's less typing).

IMHO, they must be removed.
Anyone who requires to be able to use "egrep" or "fgrep" from the
command line can use a function or alias. Given their lack of
standardization, those should not be used in scripts.

My only questions are "when?" and "how?". I.e. first release
intermediate scripts that emit a warning every time they are used, or
just drop them from the list of installed targets.

I won't do either now, but they've been deprecated for so long I'll
definitely consider it for the next release.

bug#47649: grep bug report - improper handling of file symlinks with -r option

2021-08-07 Thread Jim Meyering

On Sat, Aug 7, 2021 at 4:43 AM Chris Drake  wrote:
> Looks like the -r and -R got mixed up?
>
> What is now the "-R" behaves the same way as the original "-r", and now the 
> new "-r" behaves differently.
>
> For backwards-compatibility, the original behaviour should have been 
> preserved, and the new feature assigned to a new switch ?

Hi Chris,

There was no mix-up, and this is not new.
This was a deliberate decision that dates back to 2012 (first release
with it was 2.12).
Sorry this causes you difficulty, but at least one other grep
implementation has -r and -R options that work this way:

https://www.freebsd.org/cgi/man.cgi?query=grep

bug#45849: Autoconf options and dependencies

2021-08-06 Thread Jim Meyering

On Thu, Jan 14, 2021 at 8:36 AM Jim Meyering  wrote:
> On Thu, Jan 14, 2021 at 6:29 AM Jeffrey Walton  wrote:
> > I noticed Grep offers these two autoconf options (and friends):
> >
> > --with-libiconv-prefix
> > --with-libintl-prefix
> >
> > Configure does not complain when they are used.
> >
> > However, when I check link dependencies with ldd:
> >
> > $ ldd /usr/local/bin/grep
> > linux-vdso.so.1
> > libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3
> > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
> > /lib64/ld-linux-x86-64.so.2
> >
> > There seems to be no iConv dependencies, but there does seem to be
> > PCRE dependencies. (It may be time to bump to PCRE2 since PCRE is
> > end-of-life and the sources have some undefined behavior present).
> >
> > If I use --with-libpcre-prefix, then I get an autoconf warning:
> >
> > WARNING: unrecognized options: --with-libpcre-prefix
> >
> > It looks like something is a bit sideways.
>
> Hi Jeffrey,
>
> Thanks for investigating. I haven't looked at the
> --with-libiconv-prefix or --with-libintl-prefix issues, but PCRE2
> (overdue, indeed) is on our radar. I expect that the switch to it will
> happen soon. That may serve as impetus for the next release.

Actually, PCRE2 is no longer an immediate priority.
The cost/value ratio seems too high for now, so unless someone else
does the work it's unlikely to change.

bug#47834: grep: Document --group-separator/--no-group-separator

2021-08-06 Thread Jim Meyering

On Fri, Apr 16, 2021 at 5:15 PM Kevin Locke  wrote:
> It would be great if the grep.1 man page and --help usage information
> included the --group-separator and --no-group-separator options (which
> are already documented in grep.texi).  I've attached patches to do that.

Thank you. I've applied those with one tiny change:
I added an additional space between the option spec and its
description to placate "make syntax-check".

bug#47649: grep bug report - improper handling of file symlinks with -r option

2021-08-06 Thread Jim Meyering

tags 47649 notabug
close 47649
done

On Wed, Apr 7, 2021 at 7:21 PM Chris Drake  wrote:
> *This is the original working grep behaviour - it found text inside files
> that were symlinks:-*
> (Note the output: "*./folder/testfile1:this is my test file*")
>
> [root@ir2 ~]# grep --version
> grep (GNU grep) 2.5.1
>
> Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions. There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>
> [root@ir2 ~]# mkdir bugdemo
> [root@ir2 ~]# cd bugdemo
> [root@ir2 bugdemo]# echo "this is my test file" > testfile1
> [root@ir2 bugdemo]# echo "this is my test file" > testfile2
> [root@ir2 bugdemo]# mkdir folder
> [root@ir2 bugdemo]# cd folder
> [root@ir2 folder]# echo "this is my test file" > testfile3
> [root@ir2 folder]# ln -s ../testfile1
> [root@ir2 folder]# cd ..
> [root@ir2 bugdemo]# unalias grep
> [root@ir2 bugdemo]# grep -r test .
> ./testfile2:this is my test file
> ./folder/testfile1:this is my test file
> ./folder/testfile3:this is my test file
> ./testfile1:this is my test file
> [root@ir2 bugdemo]#
...
> It looks like processing exists to not recursively follow symlinks, and
> someone has messed with that which has caused files to no longer be
> searched by mistake.

Thanks for the report, but that is the documented behavior of -r.
You appear to prefer -R:

‘-r’
‘--recursive’
 For each directory operand, read and process all files in that
 directory, recursively.  Follow symbolic links on the command line,
 but skip symlinks that are encountered recursively.  Note that if
 no file operand is given, grep searches the working directory.
 This is the same as the ‘--directories=recurse’ option.

‘-R’
‘--dereference-recursive’
 For each directory operand, read and process all files in that
 directory, recursively, following all symbolic links.

bug#46179: Tweak to man page

2021-01-31 Thread Jim Meyering

On Fri, Jan 29, 2021 at 11:51 AM Robert Bruntz  wrote:
>   I would like to recommend a minor tweak to the man page for GNU grep.
> (The version I'm looking at is 2.20, but I doubt that matters.)
>   I would recommend changing the description of the -l and -L options from
> this:
> The scanning will stop on the first match.
>   to something like this:
> The scanning of a file will stop on the first match.
>   The reason for this is that the first version is ambiguous, in that it
> could be read as the grep command itself will stop at the first match, thus
> printing only the name of the first file that matches (-l) or doesn't match
> (-L), rather than the scanning of each file will stop on the first match
> and start again on the next file.

Thanks for the report. That has highlighted the fact that the sentence
in question doesn't even make sense for -L, so I've deleted it. Note
that for the -l option, this was documented properly in grep.texi (the
primary documentation -- you can read via "info grep"), but I've
tweaked the wording there slightly and propagated that wording to the
man page.

I'll push the attached later today.

0001-doc-man-fix-L-description-and-improve-l-s.patch
Description: Binary data

bug#45849: Autoconf options and dependencies

2021-01-14 Thread Jim Meyering

On Thu, Jan 14, 2021 at 6:29 AM Jeffrey Walton  wrote:
> I noticed Grep offers these two autoconf options (and friends):
>
> --with-libiconv-prefix
> --with-libintl-prefix
>
> Configure does not complain when they are used.
>
> However, when I check link dependencies with ldd:
>
> $ ldd /usr/local/bin/grep
> linux-vdso.so.1
> libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3
> libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
> /lib64/ld-linux-x86-64.so.2
>
> There seems to be no iConv dependencies, but there does seem to be
> PCRE dependencies. (It may be time to bump to PCRE2 since PCRE is
> end-of-life and the sources have some undefined behavior present).
>
> If I use --with-libpcre-prefix, then I get an autoconf warning:
>
> WARNING: unrecognized options: --with-libpcre-prefix
>
> It looks like something is a bit sideways.

Hi Jeffrey,

Thanks for investigating. I haven't looked at the
--with-libiconv-prefix or --with-libintl-prefix issues, but PCRE2
(overdue, indeed) is on our radar. I expect that the switch to it will
happen soon. That may serve as impetus for the next release.

bug#45432: Use both --include and --exclude at the same time

2021-01-04 Thread Jim Meyering

tags 45432 moreinfo
stop

On Fri, Dec 25, 2020 at 8:57 AM Fred .Flintstone  wrote:
> It seems --exclude does nothing when --include is used. It would be useful
> to be able to use both together, in order to do things such as recusively
> grepping files of a certain file extension while excluding certain
> directories.
>
> Example:
> $ grep --recursive --include="*.cs" --exclude="*/tests/*"

Can you provide a complete example showing a malfunction?
You've probably already read this from "info grep", but see also
the description of --exclude there:

‘--include=GLOB’
 Search only files whose name matches GLOB, using wildcard matching
 as described under ‘--exclude’.  If contradictory ‘--include’ and
 ‘--exclude’ options are given, the last matching one wins.  If no
 ‘--include’ or ‘--exclude’ options match, a file is included unless
 the first such option is ‘--include’.

bug#45353: Errors in man pages

2020-12-23 Thread Jim Meyering

On Wed, Dec 23, 2020 at 10:03 AM Helge Kreutzmann  wrote:
>
> Hello Jim,
> On Wed, Dec 23, 2020 at 09:14:20AM -0800, Jim Meyering wrote:
> > On Wed, Dec 23, 2020 at 8:46 AM Jim Meyering  wrote:
> > > On Mon, Dec 21, 2020 at 8:12 AM Helge Kreutzmann  
> > > wrote:
> > > Thank you for the suggestions.
>
> You are welcome. Thanks for handling them this quickly.
>
> I'm only citing parts where necessary, not those where no question or
> disagreement is raised. Where applicable, I denote that in our
> translated files so the translators in question are aware of your
> rationales.

+1

...
>
> I'm fine if you put my name in there. If possible it would be fair if
> you could denote (in a comment of commit message) that the input came
> from several members of the manpage-l10n translation community.

Done and pushed. See this:
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=91ce9cdad384cb6d774e9884707c7f00946d909d

bug#45353: Errors in man pages

2020-12-23 Thread Jim Meyering

On Wed, Dec 23, 2020 at 8:46 AM Jim Meyering  wrote:
>
> On Mon, Dec 21, 2020 at 8:12 AM Helge Kreutzmann  wrote:
> ...
>
> Thank you for the suggestions.
>
> > Man page: grep.1
> > Issue: The option was mentioned above!
> >
> > "Suppress normal output; instead print a count of matching lines for each "
> > "input file.  With the B<-v>, B<-\\^-invert-match> option (see below), 
> > count "
> > "non-matching lines."
>
> That is expected. Its primary description is above.
> This part tells how it works when combined with the -c option.
>
> > --
> > Man page: grep.1
> > Issue: Reorder text?
> >
> > "Report Unix-style byte offsets.  This switch causes B to report byte 
> > "
> > "offsets as if the file were a Unix-style text file, i.e., with CR 
> > characters "
> > "stripped off.  This will produce results identical to running B on a 
> > "
> > "Unix machine.  This option has no effect unless B<-b> option is also used; 
> > "
> > "it has no effect on platforms other than MS-DOS and MS-Windows."
>
> As mentioned below, this entire option (-u) is about to disappear.
>
> > --
> > Man page: grep.1
> > Issue 1: pcresyntax(3) → B(3)
> > Issue 2: pcrepattern(3) → B(3)
> >
> > "B understands three different versions of regular expression syntax: 
> > "
> > "``basic'' (BRE), ``extended'' (ERE) and ``perl'' (PCRE).  In GNU B "
> > "there is no difference in available functionality between basic and 
> > extended "
> > "syntaxes.  In other implementations, basic regular expressions are less "
> > "powerful.  The following description applies to extended regular "
> > "expressions; differences for basic regular expressions are summarized "
> > "afterwards.  Perl-compatible regular expressions give additional "
> > "functionality, and are documented in pcresyntax(3) and pcrepattern(3), but 
> > "
> > "work only if PCRE is available in the system."
>
> Done.
>
> > --
> > Man page: grep.1
> > Issue: Order of entrie not according to man-pages(7)
> >
> > "B(1), B(1), B(1), B(1), B(1), B(1), "
> > "B(1), B(1), B(2), B(3), B(3), "
> > "B(3), B(5), B(7), B(7)."
>
> I read man-pages(7)'s section on SEE ALSO and found only one nit here.
> It terminated the list with a period, while man-pages(7) says to
> provide no period.
>
> > The list should be ordered by section number and then alphabetically by 
> > name.  Do not terminate this list with a period.
>
> If there's something else, please be more precise.
>
> Actually, for each suggestion in the future, please provide an actual
> diff. That would be far better.
>
> > --
> > Man page: grep.1
> > Issue: PATTERNS → I
> >
> > "Interpret PATTERNS as Perl-compatible regular expressions (PCREs).  This "
> > "option is experimental when combined with the B<-z> (B<-\\^-null-data>)  "
> > "option, and B may warn of unimplemented features."

Oops. Nearly missed this.
I've done this, too.
New patch attached.

...
>
> I wrote the attached patch in your name.
> Let me know if there's anything else to be done here.


0001-doc-adjust-man-page-syntax.patch
Description: Binary data

bug#45353: Errors in man pages

2020-12-23 Thread Jim Meyering

On Mon, Dec 21, 2020 at 8:12 AM Helge Kreutzmann  wrote:
...

Thank you for the suggestions.

> Man page: grep.1
> Issue: The option was mentioned above!
>
> "Suppress normal output; instead print a count of matching lines for each "
> "input file.  With the B<-v>, B<-\\^-invert-match> option (see below), count "
> "non-matching lines."

That is expected. Its primary description is above.
This part tells how it works when combined with the -c option.

> --
> Man page: grep.1
> Issue: Reorder text?
>
> "Report Unix-style byte offsets.  This switch causes B to report byte "
> "offsets as if the file were a Unix-style text file, i.e., with CR characters 
> "
> "stripped off.  This will produce results identical to running B on a "
> "Unix machine.  This option has no effect unless B<-b> option is also used; "
> "it has no effect on platforms other than MS-DOS and MS-Windows."

As mentioned below, this entire option (-u) is about to disappear.

> --
> Man page: grep.1
> Issue 1: pcresyntax(3) → B(3)
> Issue 2: pcrepattern(3) → B(3)
>
> "B understands three different versions of regular expression syntax: "
> "``basic'' (BRE), ``extended'' (ERE) and ``perl'' (PCRE).  In GNU B "
> "there is no difference in available functionality between basic and extended 
> "
> "syntaxes.  In other implementations, basic regular expressions are less "
> "powerful.  The following description applies to extended regular "
> "expressions; differences for basic regular expressions are summarized "
> "afterwards.  Perl-compatible regular expressions give additional "
> "functionality, and are documented in pcresyntax(3) and pcrepattern(3), but "
> "work only if PCRE is available in the system."

Done.

> --
> Man page: grep.1
> Issue: Order of entrie not according to man-pages(7)
>
> "B(1), B(1), B(1), B(1), B(1), B(1), "
> "B(1), B(1), B(2), B(3), B(3), "
> "B(3), B(5), B(7), B(7)."

I read man-pages(7)'s section on SEE ALSO and found only one nit here.
It terminated the list with a period, while man-pages(7) says to
provide no period.

> The list should be ordered by section number and then alphabetically by name. 
>  Do not terminate this list with a period.

If there's something else, please be more precise.

Actually, for each suggestion in the future, please provide an actual
diff. That would be far better.

> --
> Man page: grep.1
> Issue: PATTERNS → I
>
> "Interpret PATTERNS as Perl-compatible regular expressions (PCREs).  This "
> "option is experimental when combined with the B<-z> (B<-\\^-null-data>)  "
> "option, and B may warn of unimplemented features."
> --
> Man page: grep.1
> Issue: Not in "grep --help"; remove?
>
> "B<-y>"

This is deliberate. Undocumenting is a necessary step prior to removal
of legacy options like this:

  $ git grep -e -y doc|tail -1
  doc/grep.texi:@option{-y} is an obsolete synonym that is provided
for compatibility.

> --
> Man page: grep.1
> Issue: Not in --help; is it a valid option?
>
> "B<-u>, B<-\\^-unix-byte-offsets>"

Similarly,
  case 'u':
/* Obsolete option; it has no effect.  FIXME: Diagnose use of
   this option starting in (say) the year 2020.  */
break;

I'm doing as that suggests in a separate commit.

> --
> Man page: grep.1
> Issue: grep should be in B<>
>
> "Read all files under each directory, recursively, following symbolic links "
> "only if they are on the command line.  Note that if no file operand is "
> "given, grep searches the working directory.  This is equivalent to the B<-d "
> "recurse> option."

Done.

I wrote the attached patch in your name.
Let me know if there's anything else to be done here.


0001-doc-adjust-man-page-syntax.patch
Description: Binary data

bug#44888: [minor] [french translation] Help mentions "NONBRE" -- should be "NOMBRE"

2020-12-15 Thread Jim Meyering

> I fixed this bug and the last translation is available on the
> Translation Project for developers.

Merci, Stéphane.

bug#44888: [minor] [french translation] Help mentions "NONBRE" -- should be "NOMBRE"

2020-12-05 Thread Jim Meyering

On Thu, Nov 26, 2020 at 11:19 AM Jim Meyering  wrote:
> On Thu, Nov 26, 2020 at 5:17 AM Niols  wrote:
> > Using GNU grep version 3.6 on Archlinux.
> >
> > The french translation of --help message contains:
> >
> >  -NONBRE   identique à --context=NONBRE
> >
> > as the translation of:
> >
> >  -NUM  same as --context=NUM
> >
> > in the "Context control" section. This is a mistake and the correct
> > french word is "NOMBRE" with an "M". Funnily enough, the translation is
> > correct for the --max-count option.
> >
> > I am not sure if I should be mailing this address or directly the
> > translation project. Sorry for the noise if I was mistaken.
>
> Merci,
> Let's Cc the Language Team mentioned in that .po file: tra...@traduc.org
> Once someone fixes it, please close this bug.

Trying again...
Can someone make this trivial change?
i.e., can someone run this one-liner on coreutils' fr.po file?

  perl -pi -e s/NONBRE/NOMBRE/g fr.po

bug#44754: Extreme performance degradation in GNU grep 3.4 / 3.6

2020-12-05 Thread Jim Meyering

On Thu, Dec 3, 2020 at 12:26 AM Norihiro Tanaka  wrote:
> On Thu, 26 Nov 2020 21:41:20 -1000
> Jim Meyering  wrote:
>
> > On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering  wrote:
> > >
> > > On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering  wrote:
> > > > Thank you for the fine bug report.
> > > > The grep-3.6 bug you've exposed is due to the fact that your input
> > > > triggers excessive hash collisions when using the code modeled after
> > > > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> > > > take O(N^2) time for N patterns. In the attached, I've switched grep
> > > > to use the djb2 hash function, and that resolves the problem. I'll
> > > > also add a NEWS entry and a test before pushing this.
> > >
> > > Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
> > > Here's an example that would take 2-3 days with grep-3.6 and only
> > > seconds with this fix:
> > >
> > >   : | grep -Ff <(seq 640 | tr 0-9 A-J)
> > >
> > > Here's a complete patch.
> > > I'll push it later today.
> >
> > Pushed along with two gnulib-related changes.
>
> The fix has improved some performance.  However, it's still quite slow
> compared to version 3.3, and that can be remedied.
>
> It converts to grep only if the potential match does not match the word
> frequently.

Thank you for that patch. Can you say a little more about the domain
of the problem?
I.e., is it specific to invocations with "-w"?
Can you provide an example that exhibits the performance improvement,
with timings?

bug#44754: Extreme performance degradation in GNU grep 3.4 / 3.6

2020-11-26 Thread Jim Meyering

On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering  wrote:
>
> On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering  wrote:
> > Thank you for the fine bug report.
> > The grep-3.6 bug you've exposed is due to the fact that your input
> > triggers excessive hash collisions when using the code modeled after
> > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> > take O(N^2) time for N patterns. In the attached, I've switched grep
> > to use the djb2 hash function, and that resolves the problem. I'll
> > also add a NEWS entry and a test before pushing this.
>
> Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
> Here's an example that would take 2-3 days with grep-3.6 and only
> seconds with this fix:
>
>   : | grep -Ff <(seq 640 | tr 0-9 A-J)
>
> Here's a complete patch.
> I'll push it later today.

Pushed along with two gnulib-related changes.

bug#44888: [minor] [french translation] Help mentions "NONBRE" -- should be "NOMBRE"

2020-11-26 Thread Jim Meyering

On Thu, Nov 26, 2020 at 5:17 AM Niols  wrote:
> Using GNU grep version 3.6 on Archlinux.
>
> The french translation of --help message contains:
>
>  -NONBRE   identique à --context=NONBRE
>
> as the translation of:
>
>  -NUM  same as --context=NUM
>
> in the "Context control" section. This is a mistake and the correct
> french word is "NOMBRE" with an "M". Funnily enough, the translation is
> correct for the --max-count option.
>
> I am not sure if I should be mailing this address or directly the
> translation project. Sorry for the noise if I was mistaken.

Merci,
Let's Cc the Language Team mentioned in that .po file: tra...@traduc.org
Once someone fixes it, please close this bug.

bug#44754: Extreme performance degradation in GNU grep 3.4 / 3.6

2020-11-26 Thread Jim Meyering

On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering  wrote:
> Thank you for the fine bug report.
> The grep-3.6 bug you've exposed is due to the fact that your input
> triggers excessive hash collisions when using the code modeled after
> gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> take O(N^2) time for N patterns. In the attached, I've switched grep
> to use the djb2 hash function, and that resolves the problem. I'll
> also add a NEWS entry and a test before pushing this.

Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
Here's an example that would take 2-3 days with grep-3.6 and only
seconds with this fix:

  : | grep -Ff <(seq 640 | tr 0-9 A-J)

Here's a complete patch.
I'll push it later today.


0001-grep-avoid-performance-regression-with-many-patterns.patch
Description: Binary data

bug#44754: Extreme performance degradation in GNU grep 3.4 / 3.6

2020-11-25 Thread Jim Meyering

On Thu, Nov 19, 2020 at 7:32 PM Frank Heckenbach
 wrote:
> I have a use case where I run grep with a large number of search
> patterns on a large text file. It works well with grep-3.3, but with
> grep-3.4 it quickly burned through GBs of memory and almost locked
> up my system due to swapping.
>
> To avoid attaching those large files, I could mostly reproduce the
> effects like this:
>
> ulimit -d 500  # avoid system lockup due to excessive swapping
> export LC_ALL=C# make sure no Unicode case conversions are needed
>
> % time  ./grep-3.3 -Fwif <(seq 30 | tr 0-9 A-J) <<
> real0m0.054s
> user0m0.048s
> sys 0m0.012s
>
> % time  ./grep-3.4 -Fwif <(seq 3 | tr 0-9 A-J) << ./grep-3.4: Memory exhausted
> Aborted
>
> real0m1.291s
> user0m0.696s
> sys 0m0.599s
>
> % time  ./grep-3.6 -Fwif <(seq 30 | tr 0-9 A-J) <<
> real0m13.162s
> user0m12.955s
> sys 0m0.211s
>
> grep-3.3 behaves well, even with much larger number of patterns.
> Time seems to grow linearly, and memory usage is constant.
>
> grep-3.4 behaves the worst of these 3 versions. Even with just 3
> patterns it exceeds the ulimit of 5 GB.
>
> grep-3.6 behaves a bit better than 3.4, but still bad. Time seems to
> be quadratic in the number of patterns, and though memory usage in
> this case seems to be almost constant, in my actual use case it also
> runs out of memory where grep-3.3 works well with just a few 100 MB
> used.
>
> Without "-i", grep-3.4 seems to run as fast as grep-3.3, but
> grep-3.6 is almost as slow as with "-i".
>
> So there might actually be two different issues here, one that
> affects 3.4 with "-i" and one that affects 3.6 with or without "-i".

Thank you for the fine bug report.
The grep-3.6 bug you've exposed is due to the fact that your input
triggers excessive hash collisions when using the code modeled after
gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
take O(N^2) time for N patterns. In the attached, I've switched grep
to use the djb2 hash function, and that resolves the problem. I'll
also add a NEWS entry and a test before pushing this.


0001-grep-avoid-performance-regression-with-many-patterns.patch
Description: Binary data

bug#44535: grep-3.6 released [stable]

2020-11-13 Thread Jim Meyering

On Thu, Nov 12, 2020 at 11:40 PM Paul Eggert  wrote:
> On 11/9/20 8:12 AM, Andreas Schwab wrote:
> > grep 3.6 fails to build:
> >
> > test-nl_langinfo-mt.c: In function 'threadN_func':
> > test-nl_langinfo-mt.c:185:1: error: no return statement in function 
> > returning non-void [-Werror=return-type]
> >185 | }
> >| ^
> > cc1: some warnings being treated as errors
> > make[4]: *** [Makefile:4221: test-nl_langinfo-mt.o] Error 1
>
> We have dueling compilers here, as Sun C complains if the return statements 
> are
> present[1], whereas gcc -Wreturn-type complains if they're absent. Since the
> return statements are clearly bogus and unnecessary I'm inclined to continue 
> to
> omit them.
>
> > https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:Factory/grep/f/x86_64
>
> This says you're configuring with CFLAGS='... -Werror=return-type ...'. If you
> omit the "-Werror=return-type" option the problem should go away. For 'grep',
> that option is more trouble than it's worth. (Perhaps someone should file a 
> GCC
> bug report)
>
> For the recommended set of warning options for compiling 'grep', you can use
> './configure --enable-gcc-warnings' instead.
>
> [1]
> https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=bd90572c031a25e559907ae0c2b9fd3aa632893b

This led me to realize that grep had not enabled warnings on its
compilation of gnulib-tests/.
I've begun to fix that, which has including filing this gcc bug:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97817

bug#44535: grep-3.6 released [stable]

2020-11-09 Thread Jim Meyering

On Mon, Nov 9, 2020 at 8:14 AM Andreas Schwab  wrote:
> grep 3.6 fails to build:
>
> test-nl_langinfo-mt.c: In function 'threadN_func':
> test-nl_langinfo-mt.c:185:1: error: no return statement in function returning 
> non-void [-Werror=return-type]
>   185 | }
>   | ^
> cc1: some warnings being treated as errors
> make[4]: *** [Makefile:4221: test-nl_langinfo-mt.o] Error 1

Thanks for the report.
Please also tell us which compiler you're using.
Note that for most this would only be a warning.
In order to make it a build-blocking error, you must have run
configure with --enable-gcc-warnings.

bug#43862: [PATCH] grep: set RE_NO_SUB for calling regex only to check syntax

2020-11-01 Thread Jim Meyering

On Mon, Oct 12, 2020 at 4:08 PM Jim Meyering  wrote:
> On Thu, Oct 8, 2020 at 2:41 AM Norihiro Tanaka  wrote:
> >
> > We can set RE_NO_SUB for calling regex only to check syntax.  It brings
> > performance gains in cases to have a lot of enormous epsilon nodes.
> >
> >
> > $ printf '(%02d)\n' | sed 's/0/|/g' >pat
> >
> > (before)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 6.15
> > user 4.62
> > sys 1.52
> >
> > (after)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 0.66
> > user 0.19
> > sys 0.46
>
> Thank you.
>
> FYI, when running similar commands with and without your patch (with
> an eye to adding a test), I ran this one (with your patch). It shows
> that using 80,000 terms caused grep to consume 32GB of memory before
> being OOM-killed:
>
> $ printf '(%08d)\n' | sed 's/0/|/g' | env time src/grep -Ef- /dev/null
> Command terminated by signal 9
> 6.42user 19.98system 0:57.91elapsed 45%CPU (0avgtext+0avgdata
> 32024460maxresident)k
> 6504inputs+0outputs (92major+12003644minor)pagefaults 0swaps
> [Exit 137 (KILL)]
>
> I will come back to this later this week.

We must accept the fact that extreme regular expressions will cause
resource exhaustion like that when processed by classical regex_*
functions. This is yet another good reason to prefer PCRE and to use
grep's -P option. In that case, it fails like this:

$ printf '(%08d)\n' | sed 's/0/|/g' |grep -Pf- /dev/null
grep: regular expression is too large

I have just pushed your patch, but without adding a test.

bug#44351: Bug in grep v3.2 onwards in regular expression matching

2020-11-01 Thread Jim Meyering

On Sun, Nov 1, 2020 at 10:03 AM Paul Eggert  wrote:
> Thanks to all for the bug report and quick fix. Closing the bug report.

Thanks for closing that. I've pushed the gnulib changes and am about
to push those for grep, too.

bug#44351: Bug in grep v3.2 onwards in regular expression matching

2020-11-01 Thread Jim Meyering

On Sun, Nov 1, 2020 at 8:05 AM Jim Meyering  wrote:
> On Sun, Nov 1, 2020 at 8:02 AM Jim Meyering  wrote:
> >
> > On Sun, Nov 1, 2020 at 7:31 AM Norihiro Tanaka  wrote:
> > > Hi,
> > > By the way, I was wondering whether to add the test to ere.tests or
> > > spencer1.tests or to a new file.  How should they be used properly?
> >
> > Adding the new test in either place is fine, but there should be a comment.
> >
> > Also, we need a NEWS entry. I'll add that separately.
>
> My mention of NEWS here is ambiguous.
> When I wrote that, I was thinking of grep's NEWS file.
> Just after sending, I realized I must also mention the fix in gnulib's
> NEWS file.
> Amending shortly.

I hadn't read gnulib's NEWS file in too long. It's for things like API
changes, not bug fixes.

bug#44351: Bug in grep v3.2 onwards in regular expression matching

2020-11-01 Thread Jim Meyering

On Sun, Nov 1, 2020 at 8:02 AM Jim Meyering  wrote:
>
> On Sun, Nov 1, 2020 at 7:31 AM Norihiro Tanaka  wrote:
> > Hi,
> > By the way, I was wondering whether to add the test to ere.tests or
> > spencer1.tests or to a new file.  How should they be used properly?
>
> Adding the new test in either place is fine, but there should be a comment.
>
> Also, we need a NEWS entry. I'll add that separately.

My mention of NEWS here is ambiguous.
When I wrote that, I was thinking of grep's NEWS file.
Just after sending, I realized I must also mention the fix in gnulib's
NEWS file.
Amending shortly.

> More importantly, there must be a test in gnulib. I'm adding one with
> the attached.

bug#44351: Bug in grep v3.2 onwards in regular expression matching

2020-11-01 Thread Jim Meyering

On Sun, Nov 1, 2020 at 7:31 AM Norihiro Tanaka  wrote:
> Hi,
> By the way, I was wondering whether to add the test to ere.tests or
> spencer1.tests or to a new file.  How should they be used properly?

Adding the new test in either place is fine, but there should be a comment.

Also, we need a NEWS entry. I'll add that separately.

More importantly, there must be a test in gnulib. I'm adding one with
the attached.

0001-dfa-tests-test-for-invalid-merge-fix.patch
Description: Binary data

bug#44351: Bug in grep v3.2 onwards in regular expression matching

2020-11-01 Thread Jim Meyering

On Sun, Nov 1, 2020 at 7:19 AM Jim Meyering  wrote:
>
> On Sun, Nov 1, 2020 at 12:42 AM Norihiro Tanaka  wrote:
> > Example,
> >
> >   a+a+a
> >   1 2 3
> >
> > position 1 has a repetition of "a" and other transition with "a".
> > position 2 has a repetition of "a" and other transition with "a", too.
> > Then DFA was merging the two nodes, but it is wrong.
> >
> > Now similar nodes in series are not merged.
>
> Thank you for the quick work.
> Would you please send a revised test patch? That one appears to be a
> tiny delta added to a file that only you have locally. I.e., it
> requires the new file tests/ere.tests, with 200+ lines.

Oops. My mistake. I thought you were adding the test in gnulib and
hadded a new framework there.
The tests/ere.tests file is in grep. No problem. Thanks again.

bug#44351: Bug in grep v3.2 onwards in regular expression matching

2020-11-01 Thread Jim Meyering

On Sun, Nov 1, 2020 at 12:42 AM Norihiro Tanaka  wrote:
> Example,
>
>   a+a+a
>   1 2 3
>
> position 1 has a repetition of "a" and other transition with "a".
> position 2 has a repetition of "a" and other transition with "a", too.
> Then DFA was merging the two nodes, but it is wrong.
>
> Now similar nodes in series are not merged.

Thank you for the quick work.
Would you please send a revised test patch? That one appears to be a
tiny delta added to a file that only you have locally. I.e., it
requires the new file tests/ere.tests, with 200+ lines.

bug#44352: Incorrect matches for some ERE

2020-10-31 Thread Jim Meyering

On Sat, Oct 31, 2020 at 9:18 AM Gonzalo Padrino  wrote:
>   While using GNU grep v3.4 in an Ubuntu 20.04 userspace running on top of
> Win10 WSL (yeah, i know... but also checked in other envs) i discovered
> what seems like an obvious bug (if i'm not mistaken).
>
>   The bug:
> -
> me@host:~$  echo 'y' |grep -E '^x+x+x+x+y$'
> y
> me@host:~$  echo 'xxxy' |grep -E '^x+x+x+x+y$'
> xxxy
> me@host:~$  echo 'xxy' |grep -E '^x+x+x+x+y$'
> xxy
> me@host:~$  echo 'xy' |grep -E '^x+x+x+x+y$'
>
> 
> ...the terminal supports ansi color escapes, and what's really weird is
> that only the result from the first command is colored in red. First and
> fourth commands yield correct results; the second and third do not, as they
> should not match it's input.
>
>   I've tested releases from v3.1 to latest v3.5 and found the anomalous
> behaviour in version v3.2 through v3.5. A (quick and clunky) git bisect led
> me to believe it was introduced about two years ago, possibly in commit
> 123620af88f55c3e0cc9f0aed7311c72f625bc82 (
> https://git.savannah.gnu.org/cgit/grep.git/commit/?id=123620af88f55c3e0cc9f0aed7311c72f625bc82
> ).
> If this is true, it would mean either the bug is in gnulib, or maybe grep
> needed to do some kind of extra handling on it's side.
>
> Kind regards. Gonzalo Padrino.
>
> P.S.: I had to patch some things in order to successfully compile the code
> after checking out some problematic commits (pragmas to avoid warnings
> about "pure" and "noreturn" function attributes, a missing configmake
> dependency in bootstrap.conf, etc ).
>
> P.S.: Resending message since first got lost in aether apparently.

Thanks. It was not lost. Conversation is continuing on
https://bugs.gnu.org/44351

bug#44351: Bug in grep v3.2 onwards in regular expression matching

2020-10-31 Thread Jim Meyering

On Sat, Oct 31, 2020 at 9:17 AM Gonzalo Padrino
 wrote:
> While using GNU grep v3.4 in an Ubuntu 20.04 userspace running on top of
> Win10 WSL (yeah, i know... but also checked in other envs) i discovered
> what seems like an obvious bug (if i'm not mistaken).
>   The bug:
> -
> me@host:~$  echo 'y' |grep -E '^x+x+x+x+y$'
> y
> me@host:~$  echo 'xxxy' |grep -E '^x+x+x+x+y$'
> xxxy
> me@host:~$  echo 'xxy' |grep -E '^x+x+x+x+y$'
> xxy
> me@host:~$  echo 'xy' |grep -E '^x+x+x+x+y$'
>
> 
> ...the terminal supports ansi color escapes, and what's really weird is
> that only the result from the first command is colored in red. First and
> fourth commands yield correct results; the second and third do not, as they
> should not match it's input.
>
>   I've tested releases from v3.1 to latest v3.5 and found the anomalous
> behaviour in version v3.2 through v3.5. A (quick and clunky) git bisect led
> me to believe it was introduced about two years ago, possibly in commit
> 123620af88f55c3e0cc9f0aed7311c72f625bc82 (
> https://git.savannah.gnu.org/cgit/grep.git/commit/?id=123620af88f55c3e0cc9f0aed7311c72f625bc82).
> If this is true, it would mean either the bug is in gnulib, or maybe grep
> needed to do some kind of extra handling on it's side.

Thank you for reporting that. I confirm this is a bug in the very latest.
This mistakenly matches:
  $ echo xxy |grep -E '^x+x+x+y$'
  xxy

That regular expression requires that any match have at least three
leading 'x's.

This is indeed due to a bug in gnulib's lib/dfa.c.

So far, I've found that we can band-aid fix it by disabling part of
merge_nfa_state's optimizations with this patch, but I do not propose
to make this change. This is just to show where the problem lies. I'm
pretty sure we can retain and correct the optimization.

diff --git a/lib/dfa.c b/lib/dfa.c
index 74aafa2ee..087c266c5 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -2459,7 +2459,7 @@ merge_nfa_state (struct dfa *d, idx_t tindex, char *flags,
 continue;

   if (flags[sindex] & OPT_REPEAT)
-delete (sindex, [sindex]);
+continue;

   merge2 ([dindex], [sindex], merged);

bug#43862: [PATCH] grep: set RE_NO_SUB for calling regex only to check syntax

2020-10-12 Thread Jim Meyering

On Thu, Oct 8, 2020 at 2:41 AM Norihiro Tanaka  wrote:
>
> We can set RE_NO_SUB for calling regex only to check syntax.  It brings
> performance gains in cases to have a lot of enormous epsilon nodes.
>
>
> $ printf '(%02d)\n' | sed 's/0/|/g' >pat
>
> (before)
> $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> real 6.15
> user 4.62
> sys 1.52
>
> (after)
> $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> real 0.66
> user 0.19
> sys 0.46

Thank you.

FYI, when running similar commands with and without your patch (with
an eye to adding a test), I ran this one (with your patch). It shows
that using 80,000 terms caused grep to consume 32GB of memory before
being OOM-killed:

$ printf '(%08d)\n' | sed 's/0/|/g' | env time src/grep -Ef- /dev/null
Command terminated by signal 9
6.42user 19.98system 0:57.91elapsed 45%CPU (0avgtext+0avgdata
32024460maxresident)k
6504inputs+0outputs (92major+12003644minor)pagefaults 0swaps
[Exit 137 (KILL)]

I will come back to this later this week.

bug#43527: [PATCH] grep: avoid unneeded compilation of regex

2020-09-26 Thread Jim Meyering

On Tue, Sep 22, 2020 at 8:04 PM Norihiro Tanaka  wrote:
> On Tue, 22 Sep 2020 16:25:06 -0700
> Jim Meyering  wrote:
>
> > Oh! Good timing. I was about to make a new snapshot.
> > Do you happen to have a test case handy that demonstrates the failure?
>
> I added test case to previous patch.
>
> By the way, I found the following bug in making the test case, and it's
> still left.
>
> $ env LC_ALL=tr_TR.utf8 grep -Fio i in
> Aborted (core dumped)
>
> (gdb) bt
> #0  0x003b8d032495 in raise () from /lib64/libc.so.6
> #1  0x003b8d033c75 in abort () from /lib64/libc.so.6
> #2  0x0040cdde in kwsinit (mb_trans=true) at searchutils.c:64
> #3  0x00409624 in Fcompile (pattern=0x23c1240 "i\n", size=1, 
> ignored=0, exact=true) at kwsearch.c:56
> #4  0x00409378 in main (argc=4, argv=0x7ffe76048388) at grep.c:2977

Using the latest sources plus that patch, I ran all of the tests with
an assertion that kwsearch->exact == !!start_ptr before each use of
that new member, and the assertion never failed.
As far as I can see, this patch is not necessary (also, I could not
reproduce your abort), so I'm closing this issue. Please reopen if you
can demonstrate its utility.

bug#43527: [PATCH] grep: avoid unneeded compilation of regex

2020-09-23 Thread Jim Meyering

On Wed, Sep 23, 2020 at 8:16 PM Paul Eggert  wrote:
> On 9/23/20 8:00 PM, Jim Meyering wrote:
> > Thank you, I expect to push it shortly, along with a gnulib-sync diff,
> > to pull in Paul's regex fixes.
>
> Ouch, it looks like we had dueling commits prepared, as I read your email just
> after pushing a more-extensive patch.

I noticed, but it wasn't a problem.
Thanks for all the work (from both you and Norihiro) that went into
those changes.

> I looked at Norihiro's recent "grep: fix a bug in the previous commit" patch
> <https://bugs.gnu.org/43527#28>. Although the test case added by that patch
> exposed a bug in Savannah grep before I installed the "grep: fix more
> Turkish-eyes bugs" patch just now, that test case works with current grep 
> master
> on Savannah (commit 8577dda638ebfee2b77342a4d07252745ec42a3a). This isn't
> surprising, as the "grep: fix more Turkish-eyes bugs" patch tests the same 
> thing
> plus some more stuff.
>
> It'd be good to have a different test case to demonstrate why the "grep: fix a
> bug in the previous commit" patch is needed to kwsearch.c. I'll take a look 
> at that.

I agree. Hoping to make the next snapshot soon, but not before
tomorrow (Thu) evening.

bug#43527: [PATCH] grep: avoid unneeded compilation of regex

2020-09-23 Thread Jim Meyering

On Tue, Sep 22, 2020 at 8:04 PM Norihiro Tanaka  wrote:
> On Tue, 22 Sep 2020 16:25:06 -0700
> Jim Meyering  wrote:
>
> > Oh! Good timing. I was about to make a new snapshot.
> > Do you happen to have a test case handy that demonstrates the failure?
>
> I added test case to previous patch.

Thank you, I expect to push it shortly, along with a gnulib-sync diff,
to pull in Paul's regex fixes.

> By the way, I found the following bug in making the test case, and it's
> still left.
>
> $ env LC_ALL=tr_TR.utf8 grep -Fio i in
> Aborted (core dumped)
>
> (gdb) bt
> #0  0x003b8d032495 in raise () from /lib64/libc.so.6
> #1  0x003b8d033c75 in abort () from /lib64/libc.so.6
> #2  0x0040cdde in kwsinit (mb_trans=true) at searchutils.c:64
> #3  0x00409624 in Fcompile (pattern=0x23c1240 "i\n", size=1, 
> ignored=0, exact=true) at kwsearch.c:56
> #4  0x00409378 in main (argc=4, argv=0x7ffe76048388) at grep.c:2977

Please tell us what is in your input file named "in" and what type of
system you're using.
If I guess it's the "in" file from the turkish-eyes test, and try the
following (using both your and Paul's patches), I see no failure:

$ i=$(printf '\304\261') I=$(printf '\304\260'); data="I:$I $i:i";
echo "$data" > in; env LC_ALL=tr_TR.utf8 src/grep -Fio i in
İ
i

bug#43577: wrong result for grep -io in turkish locale

2020-09-23 Thread Jim Meyering

On Wed, Sep 23, 2020 at 6:24 AM Norihiro Tanaka  wrote:
>
> In turkish locale, upper and lower case are mapped as following.
>
>   U0049 <-> U0131
>   U0069 <-> U0130
>
> It's expected that both following test cases returns U0130, but later
> returns nothing.
>
> $ printf '\304\260\n' >I  # U0130
> $ env LC_ALL=tr_TR.utf8 grep -i i I
> ?  # U0130

Oh! We must have different code or systems.
When I run anything using -i and that locale on Fedora 32, it aborts:

$ LC_ALL=tr_TR.utf8 src/grep -i a
zsh: abort (core dumped)  LC_ALL=tr_TR.utf8 src/grep -i a

bug#43527: [PATCH] grep: avoid unneeded compilation of regex

2020-09-22 Thread Jim Meyering

On Tue, Sep 22, 2020 at 3:57 PM Norihiro Tanaka  wrote:
> On Tue, 22 Sep 2020 08:50:03 -0700
> Jim Meyering  wrote:
>
> > On Tue, Sep 22, 2020 at 7:54 AM Norihiro Tanaka  wrote:
> > > On Mon, 21 Sep 2020 17:33:25 -0700
> > > Jim Meyering  wrote:
> > ...
> > > > Here are the two patches (tested on top of a third that updates to
> > > > latest gnulib). I'll await an 'ok' from Norihiro Tanaka before
> > > > pushing, since commit-log metadata is essentially immutable once
> > > > pushed.
> > >
> > > Great, thank you.  I confirmed it.
> >
> > Thanks. Pushed.
>
> Oh, I found a bug for this fix.  If Fexecute is called first without
> start_ptr and next with start_ptr, it may break.

Oh! Good timing. I was about to make a new snapshot.
Do you happen to have a test case handy that demonstrates the failure?

bug#43527: [PATCH] grep: avoid unneeded compilation of regex

2020-09-22 Thread Jim Meyering

On Tue, Sep 22, 2020 at 7:54 AM Norihiro Tanaka  wrote:
> On Mon, 21 Sep 2020 17:33:25 -0700
> Jim Meyering  wrote:
...
> > Here are the two patches (tested on top of a third that updates to
> > latest gnulib). I'll await an 'ok' from Norihiro Tanaka before
> > pushing, since commit-log metadata is essentially immutable once
> > pushed.
>
> Great, thank you.  I confirmed it.

Thanks. Pushed.

bug#43527: [PATCH] grep: avoid unneeded compilation of regex

2020-09-21 Thread Jim Meyering

On Sun, Sep 20, 2020 at 6:34 PM Jim Meyering  wrote:
>
> On Sun, Sep 20, 2020 at 12:17 AM Norihiro Tanaka  wrote:
> > Hi,
> > Performace for as following case is fixed in bug#43040.
> >
> >   $ yes 0 | head -10 | sed '$s/././' >pat
> >   $ grep -vf pat /dev/null
> >
> > However, still slow and a lot of memory wasted for the following cases.
> >
> >   $ grep -vf /usr/share/dict/linux.words /usr/share/dict/linux.words
> >
> > This bug is introduced in commit abb7f4f2325f26f930ff59b702fe42568a8e81e7.
> > Though it's an optimization for patterns with backreferences, it seems
> > to cause performance degradation in many cases due to regex
> > implementation issues.
> >
> > grep needs regex engine when patterns is not supported by DFA engine,
> > and when either given only matching (-o) or color option (--color) is
> > given.
> >
> > In other words, if none of them are met, grep only uses regex to check
> > the syntax.  grep avoids compilation of regex not to check syntax by this
> > patch.
>
> Yikes. Thank you!
> That exposes (and fixes in this common case) a problem that makes grep
> require memory that is quadratic in the number of regular expressions.
>
> To illustrate, I ran some timings.
> With only 80,000 lines of /usr/share/dict/linux.words, the following
> would use 100GB of RSS and take 3 minutes. With the fix, it used less
> than 400MB and took less than one second.
>
>   head -$N /usr/share/dict/linux.words > w; grep -vf w w
>
> NMem(k): Old New
> 2 6341188 (2.4s)103168
> 425241288 (9.29s)   199188 (0.31s)
> 8   100547432 (180s)392872 (0.66s)
>
> I've just pushed the gnulib-adjusting patch and will push the other soon.
> I'll also add a test and a NEWS item in a separate patch.

Here are the two patches (tested on top of a third that updates to
latest gnulib). I'll await an 'ok' from Norihiro Tanaka before
pushing, since commit-log metadata is essentially immutable once
pushed.


0002-tests-test-for-many-regexp-N-2-RSS-regression.patch
Description: Binary data


0001-grep-avoid-unnecessary-regex-compilation.patch
Description: Binary data

bug#43527: [PATCH] grep: avoid unneeded compilation of regex

2020-09-20 Thread Jim Meyering

On Sun, Sep 20, 2020 at 12:17 AM Norihiro Tanaka  wrote:
> Hi,
> Performace for as following case is fixed in bug#43040.
>
>   $ yes 0 | head -10 | sed '$s/././' >pat
>   $ grep -vf pat /dev/null
>
> However, still slow and a lot of memory wasted for the following cases.
>
>   $ grep -vf /usr/share/dict/linux.words /usr/share/dict/linux.words
>
> This bug is introduced in commit abb7f4f2325f26f930ff59b702fe42568a8e81e7.
> Though it's an optimization for patterns with backreferences, it seems
> to cause performance degradation in many cases due to regex
> implementation issues.
>
> grep needs regex engine when patterns is not supported by DFA engine,
> and when either given only matching (-o) or color option (--color) is
> given.
>
> In other words, if none of them are met, grep only uses regex to check
> the syntax.  grep avoids compilation of regex not to check syntax by this
> patch.

Yikes. Thank you!
That exposes (and fixes in this common case) a problem that makes grep
require memory that is quadratic in the number of regular expressions.

To illustrate, I ran some timings.
With only 80,000 lines of /usr/share/dict/linux.words, the following
would use 100GB of RSS and take 3 minutes. With the fix, it used less
than 400MB and took less than one second.

  head -$N /usr/share/dict/linux.words > w; grep -vf w w

NMem(k): Old New
2 6341188 (2.4s)103168
425241288 (9.29s)   199188 (0.31s)
8   100547432 (180s)392872 (0.66s)

I've just pushed the gnulib-adjusting patch and will push the other soon.
I'll also add a test and a NEWS item in a separate patch.

bug#33552: bug#29668: grep patches for "Binary file FOO matches" glitches

2020-09-18 Thread Jim Meyering

On Thu, Sep 17, 2020 at 7:59 PM Paul Eggert  wrote:
> On 9/17/20 3:03 PM, Jim Meyering wrote:
> > The alternative is to change that "B" to a "b", which should be fine,
> > now that it's only emitted to stderr.
>
> Makes sense.
>
> NEWS should be updated accordingly - but when I looked into doing that I came 
> up
> with the attached more-elaborate patch, which changes this new diagnostic and
> two other unusual-format diagnostics, so that they use the same "grep: 
> FILENAME:
> MESSAGE" form that grep uses everywhere else. Whaddya think?

Nice. Dropping the quote module (even if negligible size delta) is a
fine side effect. You're welcome to push that.
Thanks!

bug#33552: grep patches for "Binary file FOO matches" glitches

2020-09-17 Thread Jim Meyering

On Thu, Sep 17, 2020 at 11:46 AM Paul Eggert  wrote:
> Attached are two related 'grep' patches, one prompted by Bug#33552 "Possible 
> bug
> with handling -I option" and the other by Bug#29668 "grep: Fatal problem with
> (big) file". Although I'd normally install these on grep master, Jim has 
> started
> the ball rolling on the next grep release so I'll cc this to him to see 
> whether
> these patches can be squeezed in before the next release.

Nice! Thank you for resolving those.
The first one did indeed simplify numerous tests.
Both look fine and seem uncontroversial, so please go ahead and push them.
I'll probably update to latest gnulib this evening and then make a new snapshot.

bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.

2020-09-14 Thread Jim Meyering

On Sun, Sep 13, 2020 at 7:03 PM Paul Eggert  wrote:
> On 9/11/20 11:41 PM, Jim Meyering wrote:
> >> https://bugs.gnu.org/40634#32
> >>
> >> I'll try to take a look at the later patch.
> >
> > Oh! Glad you spotted that.
>
> I took a look and the basic idea sounds good though I admit I did not check
> every detail. While looking into it I found some opportunities for 
> improvements,
> plus I found what appear to be some longstanding bugs in the area, one of 
> which
> causes a grep test failure on Solaris (and I suspect the bug is also on
> GNU/Linux but the grep tests don't catch it). I installed the attached patches
> into Gnulib, updated grep to point to the new Gnulib version, and added a note
> in grep's NEWS file about this.
>
> Patch 1 is what Norihiro Tanaka proposed in Bug#40634#32, except I edited the
> commit message. Patch 2 consists of minor cleanups and performance tweaks for
> Patch 1. (Patches 3 and 4 are omitted as they were installed by others into
> Gnulib at about the same time I was installing these.) Patch 5 fixes a
> dfa-heap-overrun failure on Solaris that appears to be a longstanding bug
> exposed by Patch 1 when running on Solaris. Patch 6 merely cleans up code near
> Patch 5. Patch 7 fixes the use of an uninitialized constraint, which I
> discovered while debugging Patch 5 under Valgrind; this also appears to be a
> longstandiung bug.
>
> Coming up with test cases for all these bugs would be pretty tricky, 
> unfortunately.

Wow! Thank you!

bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.

2020-09-12 Thread Jim Meyering

On Sat, Sep 12, 2020 at 1:01 AM Paul Eggert  wrote:
> > And here is the adjusted patch:
>
> Hold on, that looks like a cleanup of the April 18 patch posted here:
>
> https://bugs.gnu.org/40634#26
>
> But there's a later patch dated April 19, which Norihiro Tanaka said should be
> more correct and simpler:
>
> https://bugs.gnu.org/40634#32
>
> I'll try to take a look at the later patch.

Oh! Glad you spotted that.

bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.

2020-09-11 Thread Jim Meyering

On Fri, Sep 11, 2020 at 2:47 PM Jim Meyering  wrote:
> On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka  wrote:
> > On Sun, 19 Apr 2020 07:41:49 +0900
> > Norihiro Tanaka  wrote:
> > > On Sat, 18 Apr 2020 00:22:26 +0900
> > > Norihiro Tanaka  wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > > Norihiro Tanaka  wrote:
> > > >
> > > > >
> > > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > > Norihiro Tanaka  wrote:
> > > > >
> > > > > >
> > > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > > Paul Eggert  wrote:
> > > > > >
> > > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > > >
> > > > > > > > I have had no idea to solve the problem yet.  If we revert it, 
> > > > > > > > bug#33357
> > > > > > > > will come back.
> > > > > > >
> > > > > > > Yes, I'd rather not revert if we can help it.
> > > > > > >
> > > > > > > My own thought was to not analyze the regular expression if we 
> > > > > > > discover that the input is empty. :-)
> > > > > >
> > > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > > including in follows before remove epsilon nodes.
> > > > >
> > > > >
> > > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > > >
> > > > It was improved previous patch.
> > >
> > > Sorry, correct patch is here.
> >
> > I made the previous patch even simpler.
> >
> > before:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 7.24
> > user 7.14
> > sys 0.09
> >
> > after:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 0.62
> > user 0.52
> > sys 0.10
>
> Thank you for this patch. I have rebased and made minor syntactic changes.
> I'll push it to gnulib soon, if not today, then by Monday.
>
> I am considering creating a test case in grep, but it feels too tight
> to be feasible: I would use a relative perf test, requiring that a
> passing test incur a perf cost of less than say 100x. Here's the
> beginnings of my attempt (note: this is just an outline -- obviously
> would not rely on having "time" in path or as a shell builtin):
>
> gen()
> {
>   local n=$1
>   local i=1
>   while : ; do
> local pat=$(printf $i | sha1sum | cut -d' ' -f1)
> printf '%s\n' "$pat$pat(\$|$pat)"
> i=$(expr $i + 1)
> test $i = $n && break
>   done
> }
>
> gen 4000 > pats-4000
> head -400 pats-4000 > pats-400
>
> # With fixed code, that a 10x input size increase (n=400 to 4000)
> # induces a 40x runtime increase: .05 -> 2.0s
> # Just prior to this change, it's 150x: 0.2 -> 30s
>
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null

And here is the adjusted patch:


dfa.c-epsilon-node-removal-speedup.patch
Description: Binary data

bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.

2020-09-11 Thread Jim Meyering

On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka  wrote:
> On Sun, 19 Apr 2020 07:41:49 +0900
> Norihiro Tanaka  wrote:
> > On Sat, 18 Apr 2020 00:22:26 +0900
> > Norihiro Tanaka  wrote:
> >
> > >
> > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > Norihiro Tanaka  wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > Norihiro Tanaka  wrote:
> > > >
> > > > >
> > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > Paul Eggert  wrote:
> > > > >
> > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > >
> > > > > > > I have had no idea to solve the problem yet.  If we revert it, 
> > > > > > > bug#33357
> > > > > > > will come back.
> > > > > >
> > > > > > Yes, I'd rather not revert if we can help it.
> > > > > >
> > > > > > My own thought was to not analyze the regular expression if we 
> > > > > > discover that the input is empty. :-)
> > > > >
> > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > including in follows before remove epsilon nodes.
> > > >
> > > >
> > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > >
> > > It was improved previous patch.
> >
> > Sorry, correct patch is here.
>
> I made the previous patch even simpler.
>
> before:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 7.24
> user 7.14
> sys 0.09
>
> after:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 0.62
> user 0.52
> sys 0.10

Thank you for this patch. I have rebased and made minor syntactic changes.
I'll push it to gnulib soon, if not today, then by Monday.

I am considering creating a test case in grep, but it feels too tight
to be feasible: I would use a relative perf test, requiring that a
passing test incur a perf cost of less than say 100x. Here's the
beginnings of my attempt (note: this is just an outline -- obviously
would not rely on having "time" in path or as a shell builtin):

gen()
{
  local n=$1
  local i=1
  while : ; do
local pat=$(printf $i | sha1sum | cut -d' ' -f1)
printf '%s\n' "$pat$pat(\$|$pat)"
i=$(expr $i + 1)
test $i = $n && break
  done
}

gen 4000 > pats-4000
head -400 pats-4000 > pats-400

# With fixed code, that a 10x input size increase (n=400 to 4000)
# induces a 40x runtime increase: .05 -> 2.0s
# Just prior to this change, it's 150x: 0.2 -> 30s

env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null

bug#18406: O_NOATIME patch

2020-09-04 Thread Jim Meyering

On Tue, Sep 1, 2020 at 4:22 AM Paul Eggert  wrote:
> On 9/11/14 1:13 PM, Paul Eggert wrote:
> > Thanks, but there's no need for that; just have 'grep' complain if the 
> > option is
> > used and O_NOATIME == 0.
>
> On looking into this more today, O_NOATIME seems to be just a best-effort 
> thing
> as some GNU/Linux filesystems ignore it, so grep should just join the throng 
> and
> not worry whether O_NOATIME actually works.
>
> Also, the O_NOATIME support was withdrawn from fts a couple of years ago, so
> 'grep -r' can't easily avoid updating atime on directories.
>
> A patch is attached. I'm still of two minds about this. The efficiency 
> argument
> for the new option is not as strong as it used to be, now that relatime has
> taken over on ext4 style filesystems. So the main argument is "I want to 
> search
> through this directory but don't want it to count as an access"; although 
> that's
> indeed a use case I'm not quite sure it's worth modifying 'grep' over. It
> doesn't seem to be worth using up a scarce option letter over, anyway, so the
> attached patch uses just a long option.

I confess to similar ambivalence, but do like the idea. Has anyone run
tests to compare performance on file systems like ext4, btrfs (the
default with Fedora 33) and xfs?

bug#41004: Documentation:enhancement - search for hexvalue

2020-05-12 Thread Jim Meyering

On Sun, May 10, 2020 at 10:00 AM Stephane Chazelas
 wrote:
>
> 2020-05-01 19:05:28 +0200, radisso...@web.de:
> [...]
> > problem: grep for a character where only the hexcode in known.
> >
> > solution:use $'\xNN'
> >  then shell expands this to the required code
> >
> > example:   printf "A\nB\nC\n" | grep $'\x41'
> [...]
>
> The $'\x41' ksh93 quoting operator expands to *byte* values.
>
> To get a character based on the Unicode codepoint value, you'd
> need the $'\u41' zsh operator (or $'\U1' for code points
> above 0x).
>
> But in any case, that is done by the shell, that has nothing to
> do with grep and the syntax of those shell operators varies
> between shells.
>
> In the fish shell you'd use:
>
> grep \u41
>
> or
>
> grep \x41
>
> instead.
>
> Also, since it's done by the shell, things like:
>
> grep $'\u2e'
>
> where U+002E is "FULL STOP", would not only match on "."
> characters but on any character. All grep sees is a "."
> character. That would be different from grep -P '\x2e' which
> matches "." (U+002E) only.
>
> Note that:
>
> grep -P '\xE9'
>
> matches on the byte 0xE9 in singlebyte locales (regardless of
> what character that byte represents in the locale's charset) and
> on character U+00E9 in UTF-8 locales (so the 0xc3 0xa9 sequence
> of bytes, not byte 0xe9).

Thank you for the thorough reply, Stephane!
Bearing that in mind, Radisson, please consider submitting a revised patch.
I suggest to recommend something like this:

$ printf '%s\n' A B C| LC_ALL=C grep -P '\x41'
A

so that the example is independent of both the current locale and the shell.

bug#41004: Documentation:enhancement - search for hexvalue

2020-05-03 Thread Jim Meyering

On Fri, May 1, 2020 at 10:07 AM  wrote:
> Hi,
> i had the problem of searching for a non-printable character in a long
> list of strings. I found nothing the documentation and but several discussion
> how to do that where either complicated or did not fit for my case, maybe i
> was unlucky, ntl i found a simple solution that should be mentioned in the
> documentation.
>
> problem: grep for a character where only the hexcode in known.
>
> solution:use $'\xNN'
>  then shell expands this to the required code
>
> example:   printf "A\nB\nC\n" | grep $'\x41'
>
> note: that uses only printable characters, it works also with anything else
>  except \0 (i guess).
>
> i found that solution nice, it did no require any flags etc, for my problem it
> worked like a charm.
> (i am not member of the list please reply directly to this address) .

Thank you for the suggestion. Another approach is to use grep's -P option:

$ printf '%s\n' A B C| grep -P '\x41'
A

If you'd like to add an example to the documentation, please send a
patch, but I'm not sure how much of PCRE syntax we want to document in
grep's own manual.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1206 matches

Mail list logo