[1003.1(2008)/Issue 7 0001919]: Add \A and \z to regular expressions (at least EREs)

Austin Group Issue Tracker via austin-group-l at The Open Group Sat, 19 Apr 2025 14:27:28 -0700

The following issue has been SUBMITTED. 
====================================================================== 
https://www.austingroupbugs.net/view.php?id=1919 
====================================================================== 
Reported By:                dwheeler
Assigned To:                ajosey
====================================================================== 
Project:                    1003.1(2008)/Issue 7
Issue ID:                   1919
Category:                   Front Matter
Type:                       Clarification Requested
Severity:                   Editorial
Priority:                   normal
Status:                     Under Review
Name:                       David A. Wheeler 
Organization:                
User Reference:              
Section:                    9. Regular Expressions 
Page Number:                1 
Line Number:                1 
Interp Status:              --- 
Final Accepted Text:         
====================================================================== 
Date Submitted:             2025-04-19 21:01 UTC
Last Modified:              2025-04-19 21:01 UTC
====================================================================== 
Summary:                    Add \A and \z to regular expressions (at least EREs)
Description: 
I propose adding \A and \z to regular expressions (regexes) to reduce the
likelihood of incorrectly copied regular expressions leading to security
vulnerabilities. At least do this for EREs, and preferably both EREs and BREs.


Regexes are widely used. A common use for regexes is to implement security
checks. Regexes are widely used to check that all inputs match constrained
patterns *before* the inputs are accepted. The usual way to use regexes for
security is to begin the regex with “anchor at string beginning” (^ is the
closest in POSIX) and end it with “anchor at string end” ($ is the closest
in POSIX).

Unfortunately ecosystems do NOT agree on a standardized way to spell these
anchors, as “^” and “$” do NOT have the same meanings across all
ecosystems. For example, “$” means “allow optional newlines as well” in
languages like Perl, Python, PHP, and C#. The “^” means “begin any line”
in Ruby (and similar for “$”). For guidance on how to handle this variation
between ecosystems, see “Correctly Using Regular Expressions for Secure Input
Validation” https://best.openssf.org/Correctly-Using-Regular-Expressions

This is a problem. Davis et al’s “Why Aren’t Regular Expressions a Lingua
Franca?…” (2019) https://arxiv.org/abs/2105.04397 found a serious problem.
Of surveyed developers, 94% reuse regexes, 50% use reuse regexes at least half
the time, and 47% incorrectly believed that regex notation is the same
everywhere. LLMs will only make this error worse, as they will notice patterns
of “almost” correct results that seem to repeat, and copy this subtle
pattern incorrectly. If humans often make this mistake, systems trained on bad
data and generalizing it will make it worse.

In addition, if REG_NEWLINE (multi-line) mode is enabled, POSIX currently has no
mechanism to “always anchor at the beginning of the string”. POSIX “^”
and “$” change meaning and there is currently no standard mechanism to
always match only at the beginning or end of a string.

It would be great to “heal the rift” between regex notations for this common
case, so that people could write simple regexes that really WOULD be interpreted
the same way across ecosystems. This healing would also provide missing
functionality. In short, make it possible to ALWAYS use \A for beginning of
string in all cases (even for multi-line matches) and \z for end of string in
all cases (even for multi-line matches). This capability doesn’t exist at all
in POSIX regex (though ^ and $ are similar). This solution was recommended in
“Correctly Using Regular Expressions for Secure Input Validation -
Rationale”
https://best.openssf.org/Correctly-Using-Regular-Expressions-Rationale

GNU extends the POSIX regex syntax with this functionality, but it’s spelled
differently:
‘\`’ matches the beginning of the whole input
‘\'’ matches the end of the whole input
This suggests that there is a desire by some to have this functionality.
However, few other systems have copied this syntax. Most other ecosystems use \A
and \z instead. For more see:
https://www.gnu.org/software/findutils/manual/html_node/find_html/posix_002dextended-regular-expression-syntax.html

Note that \A and \z are widely implemented across many ecosystems. It would take
some work to implement in POSIX systems, but this is not expected to be
significant work.


Desired Action: 
Modify base definitions “9. Regular Expressions” in 
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html
as follows:

Modify the ERE text as follows:

In 9.4.8 ERE Precedence, modify “Anchoring” so “^ $” changes to “^ $
\A \z”

In 9.4.9 ERE Expression Anchoring:
* Change “The <circumflex> and <dollar-sign> special characters” to “The
<circumflex> and <dollar-sign> special characters, and the expressions \A and
\z,”
* At the end of points 1 and 2, append “This meaning changes if regcomp is
given the flag REG_NEWLINE as described there.
* Add: “3. When not inside a bracket expression, a \A shall anchor the
expression or subexpression it begins to the beginning of a string; such an
expression or subexpression can match only a sequence starting at the first
character of a string. This meaning is unchanged by REG_NEWLINE.
4. When not inside a bracket expression, a \z shall anchor the expression or
subexpression it ends to the end of a string; such an expression or
subexpression can match only a sequence ending at the last character of a
string.  This meaning is unchanged by REG_NEWLINE.”

In 9.5 Regular Expression Grammar:
In QUOTED_CHAR add \A and \z

In 9.5.3 ERE Grammar
In ERE_expression after “| ‘$’” add:
| \A
| \z

… and similarly for BRE. I thought I’d start by making proposals for ERE,
and if that seems reasonable, go back for BRE. My understanding is that
there’s more hesitation to change BREs, so while I’d prefer to do this for
both EREs and BREs, I’d rather get 50% than 0%.





Modify regcomp in https://pubs.opengroup.org/onlinepubs/9799919799/
as follows:

Modify “REG_NOTBOL
The first character of the string pointed to by string is not the beginning of
the line. Therefore, the <circumflex> character ('^'), when taken as a special
character, shall not match the beginning of string.” and append “This
behavior is modified by REG_NEWLINE as described below. Similarly \A, when taken
as an anchor, shall not match the beginning of string, and its behavior is not
modified by REG_NEWLINE.”

Modify “REG_NOTEOL
The last character of the string pointed to by string is not the end of the
line. Therefore, the <dollar-sign> ('$'), when taken as a special character,
shall not match the end of string.” and append “This behavior is modified by
REG_NEWLINE as described below. Similarly \z, when taken as an anchor, shall not
match the end of string, and its behavior is not modified by REG_NEWLINE.”

Append to RATIONALE:
“The \A and \z anchors were added to make it easier to reuse regular
expressions between different ecosystems and to provide a mechanism that ALWAYS
means “exactly anchor at the beginning and end of string” even in the
presence of REG_NEWLINE.”

Note: I don't see the PDF, so the page# and line# aren't quite right, but I hope
my references sections will make this clear.  I welcom corrections on this or
other issues. Thank you.


====================================================================== 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2025-04-19 21:01 dwheeler       New Issue                                    
2025-04-19 21:01 dwheeler       Status                   New => Under Review 
2025-04-19 21:01 dwheeler       Assigned To               => ajosey          
======================================================================

[1003.1(2008)/Issue 7 0001919]: Add \A and \z to regular expressions (at least EREs)

Reply via email to