The following issue has been SUBMITTED. ====================================================================== https://www.austingroupbugs.net/view.php?id=1919 ====================================================================== Reported By: dwheeler Assigned To: ajosey ====================================================================== Project: 1003.1(2008)/Issue 7 Issue ID: 1919 Category: Front Matter Type: Clarification Requested Severity: Editorial Priority: normal Status: Under Review Name: David A. Wheeler Organization: User Reference: Section: 9. Regular Expressions Page Number: 1 Line Number: 1 Interp Status: --- Final Accepted Text: ====================================================================== Date Submitted: 2025-04-19 21:01 UTC Last Modified: 2025-04-19 21:01 UTC ====================================================================== Summary: Add \A and \z to regular expressions (at least EREs) Description: I propose adding \A and \z to regular expressions (regexes) to reduce the likelihood of incorrectly copied regular expressions leading to security vulnerabilities. At least do this for EREs, and preferably both EREs and BREs.
Regexes are widely used. A common use for regexes is to implement security checks. Regexes are widely used to check that all inputs match constrained patterns *before* the inputs are accepted. The usual way to use regexes for security is to begin the regex with “anchor at string beginning” (^ is the closest in POSIX) and end it with “anchor at string end” ($ is the closest in POSIX). Unfortunately ecosystems do NOT agree on a standardized way to spell these anchors, as “^” and “$” do NOT have the same meanings across all ecosystems. For example, “$” means “allow optional newlines as well” in languages like Perl, Python, PHP, and C#. The “^” means “begin any line” in Ruby (and similar for “$”). For guidance on how to handle this variation between ecosystems, see “Correctly Using Regular Expressions for Secure Input Validation” https://best.openssf.org/Correctly-Using-Regular-Expressions This is a problem. Davis et al’s “Why Aren’t Regular Expressions a Lingua Franca?…” (2019) https://arxiv.org/abs/2105.04397 found a serious problem. Of surveyed developers, 94% reuse regexes, 50% use reuse regexes at least half the time, and 47% incorrectly believed that regex notation is the same everywhere. LLMs will only make this error worse, as they will notice patterns of “almost” correct results that seem to repeat, and copy this subtle pattern incorrectly. If humans often make this mistake, systems trained on bad data and generalizing it will make it worse. In addition, if REG_NEWLINE (multi-line) mode is enabled, POSIX currently has no mechanism to “always anchor at the beginning of the string”. POSIX “^” and “$” change meaning and there is currently no standard mechanism to always match only at the beginning or end of a string. It would be great to “heal the rift” between regex notations for this common case, so that people could write simple regexes that really WOULD be interpreted the same way across ecosystems. This healing would also provide missing functionality. In short, make it possible to ALWAYS use \A for beginning of string in all cases (even for multi-line matches) and \z for end of string in all cases (even for multi-line matches). This capability doesn’t exist at all in POSIX regex (though ^ and $ are similar). This solution was recommended in “Correctly Using Regular Expressions for Secure Input Validation - Rationale” https://best.openssf.org/Correctly-Using-Regular-Expressions-Rationale GNU extends the POSIX regex syntax with this functionality, but it’s spelled differently: ‘\`’ matches the beginning of the whole input ‘\'’ matches the end of the whole input This suggests that there is a desire by some to have this functionality. However, few other systems have copied this syntax. Most other ecosystems use \A and \z instead. For more see: https://www.gnu.org/software/findutils/manual/html_node/find_html/posix_002dextended-regular-expression-syntax.html Note that \A and \z are widely implemented across many ecosystems. It would take some work to implement in POSIX systems, but this is not expected to be significant work. Desired Action: Modify base definitions “9. Regular Expressions” in https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html as follows: Modify the ERE text as follows: In 9.4.8 ERE Precedence, modify “Anchoring” so “^ $” changes to “^ $ \A \z” In 9.4.9 ERE Expression Anchoring: * Change “The <circumflex> and <dollar-sign> special characters” to “The <circumflex> and <dollar-sign> special characters, and the expressions \A and \z,” * At the end of points 1 and 2, append “This meaning changes if regcomp is given the flag REG_NEWLINE as described there. * Add: “3. When not inside a bracket expression, a \A shall anchor the expression or subexpression it begins to the beginning of a string; such an expression or subexpression can match only a sequence starting at the first character of a string. This meaning is unchanged by REG_NEWLINE. 4. When not inside a bracket expression, a \z shall anchor the expression or subexpression it ends to the end of a string; such an expression or subexpression can match only a sequence ending at the last character of a string. This meaning is unchanged by REG_NEWLINE.” In 9.5 Regular Expression Grammar: In QUOTED_CHAR add \A and \z In 9.5.3 ERE Grammar In ERE_expression after “| ‘$’” add: | \A | \z … and similarly for BRE. I thought I’d start by making proposals for ERE, and if that seems reasonable, go back for BRE. My understanding is that there’s more hesitation to change BREs, so while I’d prefer to do this for both EREs and BREs, I’d rather get 50% than 0%. Modify regcomp in https://pubs.opengroup.org/onlinepubs/9799919799/ as follows: Modify “REG_NOTBOL The first character of the string pointed to by string is not the beginning of the line. Therefore, the <circumflex> character ('^'), when taken as a special character, shall not match the beginning of string.” and append “This behavior is modified by REG_NEWLINE as described below. Similarly \A, when taken as an anchor, shall not match the beginning of string, and its behavior is not modified by REG_NEWLINE.” Modify “REG_NOTEOL The last character of the string pointed to by string is not the end of the line. Therefore, the <dollar-sign> ('$'), when taken as a special character, shall not match the end of string.” and append “This behavior is modified by REG_NEWLINE as described below. Similarly \z, when taken as an anchor, shall not match the end of string, and its behavior is not modified by REG_NEWLINE.” Append to RATIONALE: “The \A and \z anchors were added to make it easier to reuse regular expressions between different ecosystems and to provide a mechanism that ALWAYS means “exactly anchor at the beginning and end of string” even in the presence of REG_NEWLINE.” Note: I don't see the PDF, so the page# and line# aren't quite right, but I hope my references sections will make this clear. I welcom corrections on this or other issues. Thank you. ====================================================================== Issue History Date Modified Username Field Change ====================================================================== 2025-04-19 21:01 dwheeler New Issue 2025-04-19 21:01 dwheeler Status New => Under Review 2025-04-19 21:01 dwheeler Assigned To => ajosey ======================================================================