yesterday i posted the below, but it seems not to have gine through the system. i have just registered, and maybe the post was rejected -- just in case, i resend it here, with further examples.
-------- Original Message -------- Subject: [^\]] in basic regexes Date: Fri, 13 Feb 2009 16:10:47 +0100 From: Wacek Kusnierczyk <[email protected]> To: [email protected] hello, i observe a behaviour of grep that i am not sure is correct, possibly due to my misunderstanding. i've recently reviewed code written is some language were the intent was to match a sequence of any number of non-']' characters. the matching was done with an underlying regex library, and i have tried the pattern directly with grep. with grep, the pattern '[^]]' matches one non-] character: grep '[^]]' <<< '[\]' # match however, in that code the pattern was '[^\]]*' (with the idea that the character ']' is a metacharacter and therefore must be escaped). according to the docs i know, it is not necessary to escape ']' within a character class when it's the first character there (as in '[]]'), since it then is not considered meta; but it shouldn't be harmful. it happens that this pattern won't do: grep '[^\]]' <<< '[\]' # no match this seems strange; i'd read the pattern as 'one character that is not ]'. clearly, the data has two such characters. alternatively, the pattern could be read as 'one character that is neither \ nor ]', but this would require the backslash to be treated as a regular character (not a meta): grep '[\]' <<< '[\]' # match grep '[^\]' <<< '[\]' # match grep '[^\[]' <<< '[\]' # match in fact, the third above has one possible match, so the pattern is read as 'one non-\ non-[' rather than as 'one non-[': grep -o '[^\[]' <<< '[\]' # ] so the 'one non-\ non-]' reading of '[^\]]' is not implausible; then, there would one match, but there is none. it actually appears that the pattern is read as 'one non-\ followed by one ]': grep -o '[^\]]' <<< '[]' # [] that is, the first ] is not escaped (coherently with the case of '[^\[]') but rather closes the character class, and the second (unescaped!) ] does not close any class, but is taken literally! (should this not be an invalid regex, with an unmatched class-closing bracket?) i haven't looked at the sources of grep, so these are plain guesses, but is the behaviour of grep with '[^\]]' correct and intended, or is it a bug? grep -V # GNU grep 2.5.3 regards, wacek ps. here are some further experiments, which seem to indicate that grep gets confused with some combinations of [, ], ^, and \. # [[] should match one opening bracket grep -o '[[]' <<< '[^\]' # [ # OK # []] should match one closing bracket grep -o '[]]' <<< '[^\]' # ] # OK # [][] should match one bracket grep -o '[][]' <<< '[^\]' # [ # ] #OK # [[]] should match one bracket # alternatively (preferred?), report invalid regex (unmatched second ]) grep -o '[[]]' <<< '[^\]' # WRONG (?) -- neither a match nor an error # [\] shoud match one backslash grep -o '[\]' <<< '[^\]' # \ # OK # [\[] should match one backslash or opening bracket grep -o '[\[]' <<< '[^\]' # [ # \ # OK # [\]] should match one backslash or closing bracket # alternatively (preferred?), report invalid regex (unmatched second ]) grep -o '[\]]' <<< '[^\]' # \] # WRONG (?) -- matches *two* characters # [[^] should match one opening bracket or caret grep -o '[[^]' <<< '[^\]' # [ # ^ # OK # [[^\] should match one opening bracket, caret, or backslash grep -o '[[^\]' <<< '[^\]' # [ # ^ # \ # OK # [[^\]] should match one opening bracket, caret, backslash, or closing bracket # alternatively (preferred?), report invalid regex (unmatched second ]) grep -o '[[^\]]' <<< '[^\]' # \] # WRONG (?) -- matches *two* characters # [\ ]] should match one backslash, space, or closing bracket # alternatively (preferred?), report invalid regex (unmatched second ]) grep -o '[\ ]]' <<< '[^\]' # \] # WRONG (?) -- matches *two* characters # [\ ] ] should match one backslash, space, or closing bracket # alternatively (preferred?), report invalid regex (unmatched second ]) grep -o '[\ ] ]' <<< '[^\]' # WRONG (?) -- neither a match nor an error grep -o '[\ ] ]' <<< '[^\ ]' # \ ] # WRONG (?) -- matches *three* characters # [\] ] should match one backslash, closing bracket, or space # alternatively (preferred?), report invalid expression (unmatched second ]) grep -o '[\] ]' <<< '[^\]' # WRONG (?) -- neither a match nor an error grep -o '[\] ]' <<< '[^\ ]' # \ ] # WRONG (?) -- matches *three* characters # [^] should report invalid regex (void ^or unmatched [) grep -o '[^]' <<< '[^\]' # grep: Unmatched [ or [^ # OK # [^]\] match one non-closing-bracket or non-backslash # alternatively, report invalid regex (void ^) grep -o '[^]\]' <<< '[^\]' # [ # ^ # WRONG (?) -- matches *two* characters, seemingly inappropriately -- ------------------------------------------------------------------------------- Wacek Kusnierczyk, MD PhD Email: [email protected] Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics & Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060 -------------------------------------------------------------------------------
