CVSROOT: /webcvs/grep Module name: grep Changes by: Jim Meyering <meyering> 22/09/03 15:33:15
Index: html_node/Problematic-Expressions.html =================================================================== RCS file: html_node/Problematic-Expressions.html diff -N html_node/Problematic-Expressions.html --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ html_node/Problematic-Expressions.html 3 Sep 2022 19:33:14 -0000 1.1 @@ -0,0 +1,197 @@ +<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> +<html> +<!-- Created by GNU Texinfo 6.8, https://www.gnu.org/software/texinfo/ --> +<head> +<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> +<!-- This manual is for grep, a pattern matching engine. + +Copyright (C) 1999-2002, 2005, 2008-2022 Free Software Foundation, +Inc. + +Permission is granted to copy, distribute and/or modify this document +under the terms of the GNU Free Documentation License, Version 1.3 or +any later version published by the Free Software Foundation; with no +Invariant Sections, with no Front-Cover Texts, and with no Back-Cover +Texts. A copy of the license is included in the section entitled +"GNU Free Documentation License". --> +<title>Problematic Expressions (GNU Grep 3.8)</title> + +<meta name="description" content="Problematic Expressions (GNU Grep 3.8)"> +<meta name="keywords" content="Problematic Expressions (GNU Grep 3.8)"> +<meta name="resource-type" content="document"> +<meta name="distribution" content="global"> +<meta name="Generator" content="makeinfo"> +<meta name="viewport" content="width=device-width,initial-scale=1"> + +<link href="index.html" rel="start" title="Top"> +<link href="Index.html" rel="index" title="Index"> +<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents"> +<link href="Regular-Expressions.html" rel="up" title="Regular Expressions"> +<link href="Character-Encoding.html" rel="next" title="Character Encoding"> +<link href="Basic-vs-Extended.html" rel="prev" title="Basic vs Extended"> +<style type="text/css"> +<!-- +a.copiable-anchor {visibility: hidden; text-decoration: none; line-height: 0em} +a.summary-letter {text-decoration: none} +blockquote.indentedblock {margin-right: 0em} +div.display {margin-left: 3.2em} +div.example {margin-left: 3.2em} +kbd {font-style: oblique} +pre.display {font-family: inherit} +pre.format {font-family: inherit} +pre.menu-comment {font-family: serif} +pre.menu-preformatted {font-family: serif} +span.nolinebreak {white-space: nowrap} +span.roman {font-family: initial; font-weight: normal} +span.sansserif {font-family: sans-serif; font-weight: normal} +span:hover a.copiable-anchor {visibility: visible} +ul.no-bullet {list-style: none} +--> +</style> +<link rel="stylesheet" type="text/css" href="https://www.gnu.org/software/gnulib/manual.css"> + + +</head> + +<body lang="en"> +<div class="section" id="Problematic-Expressions"> +<div class="header"> +<p> +Next: <a href="Character-Encoding.html" accesskey="n" rel="next">Character Encoding</a>, Previous: <a href="Basic-vs-Extended.html" accesskey="p" rel="prev">Basic vs Extended Regular Expressions</a>, Up: <a href="Regular-Expressions.html" accesskey="u" rel="up">Regular Expressions</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p> +</div> +<hr> +<span id="Problematic-Regular-Expressions"></span><h3 class="section">3.7 Problematic Regular Expressions</h3> + +<span id="index-invalid-regular-expressions"></span> +<span id="index-unspecified-behavior-in-regular-expressions"></span> +<p>Some strings are <em>invalid regular expressions</em> and cause +<code>grep</code> to issue a diagnostic and fail. For example, ‘<samp>xy\1</samp>’ +is invalid because there is no parenthesized subexpression for the +back-reference ‘<samp>\1</samp>’ to refer to. +</p> +<p>Also, some regular expressions have <em>unspecified behavior</em> and +should be avoided even if <code>grep</code> does not currently diagnose +them. For example, ‘<samp>xy\0</samp>’ has unspecified behavior because +‘<samp>0</samp>’ is not a special character and ‘<samp>\0</samp>’ is not a special +backslash expression (see <a href="Special-Backslash-Expressions.html">Special Backslash Expressions</a>). +Unspecified behavior can be particularly problematic because the set +of matched strings might be only partially specified, or not be +specified at all, or the expression might even be invalid. +</p> +<p>The following regular expression constructs are invalid on all +platforms conforming to POSIX, so portable scripts can assume that +<code>grep</code> rejects these constructs: +</p> +<ul> +<li> A basic regular expression containing a back-reference ‘<samp>\<var>n</var></samp>’ +preceded by fewer than <var>n</var> closing parentheses. For example, +‘<samp>\(a\)\2</samp>’ is invalid. + +</li><li> A bracket expression containing ‘<samp>[:</samp>’ that does not start a +character class; and similarly for ‘<samp>[=</samp>’ and ‘<samp>[.</samp>’. For +example, ‘<samp>[a[:b]</samp>’ and ‘<samp>[a[:ouch:]b]</samp>’ are invalid. +</li></ul> + +<p>GNU <code>grep</code> treats the following constructs as invalid. +However, other <code>grep</code> implementations might allow them, so +portable scripts should not rely on their being invalid: +</p> +<ul> +<li> Unescaped ‘<samp>\</samp>’ at the end of a regular expression. + +</li><li> Unescaped ‘<samp>[</samp>’ that does not start a bracket expression. + +</li><li> A ‘<samp>\{</samp>’ in a basic regular expression that does not start an +interval expression. + +</li><li> A basic regular expression with unbalanced ‘<samp>\(</samp>’ or ‘<samp>\)</samp>’, +or an extended regular expression with unbalanced ‘<samp>(</samp>’. + +</li><li> In the POSIX locale, a range expression like ‘<samp>z-a</samp>’ that +represents zero elements. A non-GNU <code>grep</code> might treat it as +a valid range that never matches. + +</li><li> An interval expression with a repetition count greater than 32767. +(The portable POSIX limit is 255, and even interval expressions with +smaller counts can be impractically slow on all known implementations.) + +</li><li> A bracket expression that contains at least three elements, the first +and last of which are both ‘<samp>:</samp>’, or both ‘<samp>.</samp>’, or both +‘<samp>=</samp>’. For example, a non-GNU <code>grep</code> might treat +‘<samp>[:alpha:]</samp>’ like ‘<samp>[[:alpha:]]</samp>’, or like ‘<samp>[:ahlp]</samp>’. +</li></ul> + +<p>The following constructs have well-defined behavior in GNU +<code>grep</code>. However, they have unspecified behavior elsewhere, so +portable scripts should avoid them: +</p> +<ul> +<li> Special backslash expressions like ‘<samp>\b</samp>’, ‘<samp>\<</samp>’, and ‘<samp>\]</samp>’. +See <a href="Special-Backslash-Expressions.html">Special Backslash Expressions</a>. + +</li><li> A basic regular expression that uses ‘<samp>\?</samp>’, ‘<samp>\+</samp>’, or ‘<samp>\|</samp>’. + +</li><li> An extended regular expression that uses back-references. + +</li><li> An empty regular expression, subexpression, or alternative. For +example, ‘<samp>(a|bc|)</samp>’ is not portable; a portable equivalent is +‘<samp>(a|bc)?</samp>’. + +</li><li> In a basic regular expression, an anchoring ‘<samp>^</samp>’ that appears +directly after ‘<samp>\(</samp>’, or an anchoring ‘<samp>$</samp>’ that appears +directly before ‘<samp>\)</samp>’. + +</li><li> In a basic regular expression, a repetition operator that +directly follows another repetition operator. + +</li><li> In an extended regular expression, unescaped ‘<samp>{</samp>’ +that does not begin a valid interval expression. +GNU <code>grep</code> treats the ‘<samp>{</samp>’ as an ordinary character. + +</li><li> A null character or an encoding error in either pattern or input data. +See <a href="Character-Encoding.html">Character Encoding</a>. + +</li><li> An input file that ends in a non-newline character, +where GNU <code>grep</code> silently supplies a newline. +</li></ul> + +<p>The following constructs have unspecified behavior, in both GNU +and other <code>grep</code> implementations. Scripts should avoid +them whenever possible. +</p> +<ul> +<li> A backslash escaping an ordinary character, unless it is a +back-reference like ‘<samp>\1</samp>’ or a special backslash expression like +‘<samp>\<</samp>’ or ‘<samp>\b</samp>’. See <a href="Special-Backslash-Expressions.html">Special Backslash Expressions</a>. For +example, ‘<samp>\x</samp>’ has unspecified behavior now, and a future version +of <code>grep</code> might specify ‘<samp>\x</samp>’ to have a new behavior. + +</li><li> A repetition operator that appears directly after an anchor, or at the +start of a complete regular expression, parenthesized subexpression, +or alternative. For example, ‘<samp>+|^*(+a|?-b)</samp>’ has unspecified +behavior, whereas ‘<samp>\+|^\*(\+a|\?-b)</samp>’ is portable. + +</li><li> A range expression outside the POSIX locale. For example, in some +locales ‘<samp>[a-z]</samp>’ might match some characters that are not +lowercase letters, or might not match some lowercase letters, or might +be invalid. With GNU <code>grep</code> it is not documented whether +these range expressions use native code points, or use the collating +sequence specified by the <code>LC_COLLATE</code> category, or have some +other interpretation. Outside the POSIX locale, it is portable to use +‘<samp>[[:lower:]]</samp>’ to match a lower-case letter, or +‘<samp>[abcdefghijklmnopqrstuvwxyz]</samp>’ to match an ASCII lower-case +letter. + +</li></ul> + +</div> +<hr> +<div class="header"> +<p> +Next: <a href="Character-Encoding.html">Character Encoding</a>, Previous: <a href="Basic-vs-Extended.html">Basic vs Extended Regular Expressions</a>, Up: <a href="Regular-Expressions.html">Regular Expressions</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p> +</div> + + + +</body> +</html>
