On 6/3/22 20:08, Bruno Haible wrote:
But when I think about the thousands of people who use regular expressions
out there. How would they remember that in parentheses both should be
backslash-escaped in EREs
\( \)
It's even weirder, in that POSIX says unmatched ')' is treated like '\)'
in an ERE, which is why gnulib/lib/dfa.c does not warn about it.
but brackets and braces are asymmetric
\[ ]
\{ }
Thanks, good catch about \}. We should treat it like \]. (And this means
regex-quote.c is buggy in a different way, sigh....)
Even if the warning message you install in grep has 3 or 5 lines and goes
into all details, we are not serving the community if we force them to adhere
to asymmetric rules, where up to now they could use symmetric rules.
Thanks, and I see Jim agrees too. I installed the first two attached
patches into Gnulib to do that and to fix regex-quote, propagated this
into Grep, and installed the last attached patch to Grep to document this.
At some point the behavior of \], \}, and all the other stuff the Grep
manuals new "Problematic Expressions" node should be documented in
gnulib/doc/regex.texi too. I'll cc this to Reuben to see whether he has
the time.
It might be useful for GNU grep to have a --pedantic flag to check
regular expression portability, to reject unportable REs like '\]'. But
any such feature can wait until after the next GNU grep release.From 0153035f93d5e537efef9119676e120034ac912b Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Fri, 3 Jun 2022 18:46:37 -0700
Subject: [PATCH 1/2] dfa: do not warn about \] and \}
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* lib/dfa.c (lex): Do not warn about \] and \}, since they’re
surely universally supported even though POSIX says their
interpretation is undefined.
---
ChangeLog | 7 +++++++
lib/dfa.c | 2 ++
lib/dfa.h | 6 +++++-
3 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/ChangeLog b/ChangeLog
index 5fe5e9ee23..053fabde2a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2022-06-04 Paul Eggert <egg...@cs.ucla.edu>
+
+ dfa: do not warn about \] and \}
+ * lib/dfa.c (lex): Do not warn about \] and \}, since they’re
+ surely universally supported even though POSIX says their
+ interpretation is undefined.
+
2022-06-03 Paul Eggert <egg...@cs.ucla.edu>
regex-quote: \] -> ] in EREs and BREs
diff --git a/lib/dfa.c b/lib/dfa.c
index bd4c5f0582..4f8367af3f 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -1563,6 +1563,8 @@ lex (struct dfa *dfa)
}
dfawarn (msg);
}
+ FALLTHROUGH;
+ case ']': case '}':
normal_char:
dfa->lex.laststart = false;
/* For multibyte character sets, folding is done in atom. Always
diff --git a/lib/dfa.h b/lib/dfa.h
index 91ec1d809f..043f0e9717 100644
--- a/lib/dfa.h
+++ b/lib/dfa.h
@@ -79,7 +79,11 @@ enum
merely a warning. */
DFA_CONFUSING_BRACKETS_ERROR = 1 << 2,
- /* Warn about stray backslashes before ordinary characters. */
+ /* Warn about stray backslashes before ordinary characters other
+ than ] and } which are special because even though POSIX
+ says \] and \} have undefined interpretation, platforms
+ reliably ignore those stray backlashes and warning about them
+ would likely cause more trouble than it's worth. */
DFA_STRAY_BACKSLASH_WARN = 1 << 3,
/* Warn about * appearing out of context at the start of an
--
2.34.1
From ac58aead465ab8bea4223060e61c33eb265e8e85 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sat, 4 Jun 2022 09:55:28 -0700
Subject: [PATCH 2/2] regex-quote: \} -> } in EREs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* lib/regex-quote.c (ere_special): Don’t use \} in EREs,
as POSIX says the interpretation is undefined.
* tests/test-regex-quote.c (test_bre, test_ere):
Add tests for }.
---
ChangeLog | 6 ++++++
lib/regex-quote.c | 2 +-
tests/test-regex-quote.c | 2 ++
3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/ChangeLog b/ChangeLog
index 053fabde2a..ed21be142f 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,11 @@
2022-06-04 Paul Eggert <egg...@cs.ucla.edu>
+ regex-quote: \} -> } in EREs
+ * lib/regex-quote.c (ere_special): Don’t use \} in EREs,
+ as POSIX says the interpretation is undefined.
+ * tests/test-regex-quote.c (test_bre, test_ere):
+ Add tests for }.
+
dfa: do not warn about \] and \}
* lib/dfa.c (lex): Do not warn about \] and \}, since they’re
surely universally supported even though POSIX says their
diff --git a/lib/regex-quote.c b/lib/regex-quote.c
index 9b92e98910..41639ea50e 100644
--- a/lib/regex-quote.c
+++ b/lib/regex-quote.c
@@ -29,7 +29,7 @@
static const char bre_special[] = "$^.*[\\";
/* Characters that are special in an ERE. */
-static const char ere_special[] = "$^.*[\\+?{}()|";
+static const char ere_special[] = "$^.*[\\+?{()|";
struct regex_quote_spec
regex_quote_spec_posix (int cflags, bool anchored)
diff --git a/tests/test-regex-quote.c b/tests/test-regex-quote.c
index 2282d5f662..918ccd5901 100644
--- a/tests/test-regex-quote.c
+++ b/tests/test-regex-quote.c
@@ -79,6 +79,7 @@ test_bre (void)
{
check ("aBc", 0, "aBc");
check ("(foo[$HOME])", 0, "(foo\\[\\$HOME])");
+ check ("(foo{$HOME})", 0, "(foo{\\$HOME})");
}
static void
@@ -86,6 +87,7 @@ test_ere (void)
{
check ("aBc", REG_EXTENDED, "aBc");
check ("(foo[$HOME])", REG_EXTENDED, "\\(foo\\[\\$HOME]\\)");
+ check ("(foo{$HOME})", REG_EXTENDED, "\\(foo\\{\\$HOME}\\)");
}
int
--
2.34.1
From 739892e8d4461a8246fa4c6a0ece18a14ce1e51b Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sat, 4 Jun 2022 10:26:35 -0700
Subject: [PATCH] doc: document \] and \}
* doc/grep.texi (Special Backslash Expressions)
(Problematic Expressions): Document that grep supports
\] and \} as extensions to POSIX.
---
doc/grep.texi | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/doc/grep.texi b/doc/grep.texi
index c34a1ae..9f2f225 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1487,6 +1487,12 @@ Match whitespace, it is a synonym for @samp{[[:space:]]}.
@item \S
Match non-whitespace, it is a synonym for @samp{[^[:space:]]}.
+@item \]
+Match @samp{]}.
+
+@item \@}
+Match @samp{@}}.
+
@end table
For example, @samp{\brat\b} matches the separate word @samp{rat},
@@ -1641,7 +1647,7 @@ portable scripts should avoid them:
@itemize @bullet
@item
-Special backslash expressions like @samp{\<} and @samp{\b}.
+Special backslash expressions like @samp{\b}, @samp{\<}, and @samp{\]}.
@xref{Special Backslash Expressions}.
@item
--
2.34.1