New submission from James Davis <davis...@vt.edu>:
I have two regexes: /(a|ab)*?b/ and /(ab|a)*?b/. If I re.search the string "ab" for these regexes, I get inconsistent behavior. Specifically, /(a|ab)*?b/ matches with capture "a", while /(ab|a)*?b/ matches with an empty capture group. I am not actually sure which behavior is correct. Interpretation 1: The (ab|a) clause matches the a, satisfying the (ab|a)*? once, and the engine proceeds to the b and completes. The capture group ends up containing "a". Interpretation 2: The (ab|a) clause matches the a. Since the clause is marked with *, the engine repeats the attempt and finds nothing the second time. It proceeds to the b and completes. Because the second match attempt on (ab|a) found nothing, the capture group ends up empty. The behavior depends on both the order of (ab|a) vs. (a|ab), and the use of the non-greedy quantifier. I cannot see why changing the order of the alternation should have this effect. The change in behavior occurs in the built-in "re" module but not in the competing "regex" module. The behavior is consistent in both Python 2.7 and Python 3.5. I have not tested other versions. I have included the confusing-regex-behavior.py file for troubleshooting. Below is the behavior for matches on these and many variants. I find the following lines the most striking: Regex pattern matched? matched string captured content -------------------- -------------------- -------------------- -------------------- (ab|a)*?b True ab ('',) (ab|a)+?b True ab ('',) (ab|a){0,}?b True ab ('',) (ab|a){0,2}?b True ab ('',) (ab|a){0,1}?b True ab ('a',) (ab|a)*b True ab ('a',) (ab|a)+b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)+?b True ab ('a',) (08:58:48) jamie@woody ~ $ python3 /tmp/confusing-regex-behavior.py Behavior from re Regex pattern matched? matched string captured content -------------------- -------------------- -------------------- -------------------- (ab|a)*?b True ab ('',) (ab|a)+?b True ab ('',) (ab|a){0,}?b True ab ('',) (ab|a){0,2}?b True ab ('',) (ab|a)?b True ab ('a',) (ab|a)??b True ab ('a',) (ab|a)b True ab ('a',) (ab|a){0,1}?b True ab ('a',) (ab|a)*b True ab ('a',) (ab|a)+b True ab ('a',) (a|ab)*b True ab ('a',) (a|ab)+b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)+?b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)*?b True ab ('a',) (bb|a)*?b True ab ('a',) ((?:ab|a)*?)b True ab ('a',) ((?:a|ab)*?)b True ab ('a',) Behavior from regex Regex pattern matched? matched string captured content -------------------- -------------------- -------------------- -------------------- (ab|a)*?b True ab ('a',) (ab|a)+?b True ab ('a',) (ab|a){0,}?b True ab ('a',) (ab|a){0,2}?b True ab ('a',) (ab|a)?b True ab ('a',) (ab|a)??b True ab ('a',) (ab|a)b True ab ('a',) (ab|a){0,1}?b True ab ('a',) (ab|a)*b True ab ('a',) (ab|a)+b True ab ('a',) (a|ab)*b True ab ('a',) (a|ab)+b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)+?b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)*?b True ab ('a',) (a|ab)*?b True ab ('a',) (bb|a)*?b True ab ('a',) ((?:ab|a)*?)b True ab ('a',) ((?:a|ab)*?)b True ab ('a',) ---------- components: Regular Expressions files: confusing-regex-behavior.py messages: 334560 nosy: davisjam, ezio.melotti, mrabarnett priority: normal severity: normal status: open title: Capture behavior depends on the order of an alternation type: behavior versions: Python 2.7, Python 3.5 Added file: https://bugs.python.org/file48085/confusing-regex-behavior.py _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue35859> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com