https://bugs.exim.org/show_bug.cgi?id=2495
Bug ID: 2495 Summary: Captures made in a lookaround inside a loop are not rolled back when backtracking Product: PCRE Version: 8.43 Hardware: All OS: All Status: NEW Severity: bug Priority: medium Component: Code Assignee: p...@hermes.cam.ac.uk Reporter: david...@earthling.net CC: pcre-dev@exim.org When a capture group makes captures inside of a lookaround in a looped group, and then the looped group backtracks due to subsequent conditions outside of itself not matching, the content of the capture group is not rolled back to the state it had on that earlier iteration of the loop. This bug is fixed in PCRE2. Here is a minimal demonstration of this bug, with lookaheads and lookbehinds: $ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:(?=(.)).)*_\1' a_bb $ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:.(?<=(.)))*_\1' a_bb $ (echo 'a_ab'; echo 'a_bb') | pcre2grep '^(?:(?=(.)).)*_\1' a_ab $ (echo 'a_ab'; echo 'a_bb') | pcre2grep '^(?:.(?<=(.)))*_\1' a_ab The loop initially consumes all characters, ending by consuming and capturing "b". It then fails to match "_", and backtracks until it does. What it should do is keep rolling back the content each time it backtracks a character, from "b" to "a" to "_" to "a", after which the match of "_" succeeds, and it \1 should contain "a" (and does, in PCRE2). But in PCRE, it still contains "b", the last thing it contained before backtracking. I apologize for not reporting this bug earlier. I found it back in 2014: https://gist.github.com/Davidebyzero/9090628#gistcomment-1187218 Then teukon described a small example in which the bug causes the wrong result (which can now be seen to have been fixed in PCRE2): https://gist.github.com/Davidebyzero/9090628#gistcomment-1188164 $ echo 'x'|pcregrep '^(?=((?=(x*))x)+)\2$' $ echo 'x'|pcre2grep '^(?=((?=(x*))x)+)\2$' x It's not merely the atomicity of the lookahead that triggers this bug; it doesn't happen with capturing in an atomic group: $ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:(?>(.)))*_\1' a_ab The only situation in which the capture actually is rolled back is if the loop backtracks all the way to zero iterations (rolling the capture back to being unset): $ echo 'a' | pcregrep '^(?:(?=(.)).)*^(?(1)(?!))' a But if it only backtracks to the first iteration, the bug still happens: $ (echo 'aab'; echo 'abb') | pcregrep '^(?:(?=(.)).)*(?<=^.)\1' abb $ (echo 'aab'; echo 'abb') | pcre2grep '^(?:(?=(.)).)*(?<=^.)\1' aab This is an annoying bug, resulting in undesired and unexpected behavior, and I can't think of any useful way to exploit it to do something PCRE could otherwise not do. On the contrary, I think it actually prevents some things from being possible in PCRE that would be otherwise. -- You are receiving this mail because: You are on the CC list for the bug. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev