On 6 December 2017 at 13:13, Serhiy Storchaka <storch...@gmail.com> wrote: > 05.12.17 22:26, Terry Reedy пише: >> >> On 12/4/2017 6:21 PM, MRAB wrote: >>> >>> I've finally come to a conclusion as to what the "correct" behaviour of >>> zero-width matches should be: """always return the first match, but never a >>> zero-width match that is joined to a previous zero-width match""". >> >> >> Is this different from current re or regex? > > > Partially. There are different ways of handling the problem of repeated > zero-width searching. > > 1. The one formulated by Matthew. This is the behavior of findall() and > finditer() in regex in both VERSION0 and VERSION1 modes, sub() in regex in > the VERSION1 mode, and findall() and finditer() in re since 3.7. > > 2. Prohibit a zero-width match that is joined to a previous match > (independent from its width). This is the behavior of sub() in re and in > regex in the VERSION0 mode, and split() in regex in the VERSION1 mode. This > is the only correctly documented and explicitly tested behavior in re. > > 3. Prohibit a zero-width match (always). This is the behavior of split() in > re in 3.4 and older (deprecated since 3.5) and in regex in VERSION0 mode. > > 4. Exclude the character following a zero-width match from following > matches. This is the behavior of findall() and finditer() in 3.6 and older. > > The case 4 is definitely incorrect. It leads to excluding characters from > matching. re.findall(r'^|\w+', 'two words') returns ['', 'wo', 'words']. > > The case 3 is pretty useless. It disallow splitting on useful zero-width > patterns like `\b` and makes `\s*` just equal to `\s+`. > > The difference between cases 1 and 2 is subtle. The case 1 looks more > logical and matches the behavior of Perl and PCRE, but the case 2 is > explicitly documented and tested. This behavior is kept for compatibility > with an ancient re implementation.
Behaviour (1) means that we'd get >>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION1) 'xx xx' (because \w* matches the empty string after each word, as well as each word itself). I just tested in Perl, and that is indeed what happens there as well. On that basis, I have to say that I find behaviour (2) more intuitive and (arguably) "correct": >>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION0) 'x x' >>> re.sub(r'\w*', 'x', 'hello world') 'x x' Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com