Re: [Python-Dev] Zero-width matching in regexes

Serhiy Storchaka Wed, 06 Dec 2017 05:16:55 -0800

05.12.17 22:26, Terry Reedy пише:

On 12/4/2017 6:21 PM, MRAB wrote:
I've finally come to a conclusion as to what the "correct" behaviourof zero-width matches should be: """always return the first match, butnever a zero-width match that is joined to a previous zero-widthmatch""".
Is this different from current re or regex?

Partially. There are different ways of handling the problem of repeatedzero-width searching.

1. The one formulated by Matthew. This is the behavior of findall() andfinditer() in regex in both VERSION0 and VERSION1 modes, sub() in regexin the VERSION1 mode, and findall() and finditer() in re since 3.7.

2. Prohibit a zero-width match that is joined to a previous match(independent from its width). This is the behavior of sub() in re and inregex in the VERSION0 mode, and split() in regex in the VERSION1 mode.This is the only correctly documented and explicitly tested behavior in re.

3. Prohibit a zero-width match (always). This is the behavior of split()in re in 3.4 and older (deprecated since 3.5) and in regex in VERSION0 mode.

4. Exclude the character following a zero-width match from followingmatches. This is the behavior of findall() and finditer() in 3.6 and older.

The case 4 is definitely incorrect. It leads to excluding charactersfrom matching. re.findall(r'^|\w+', 'two words') returns ['', 'wo','words'].

The case 3 is pretty useless. It disallow splitting on useful zero-widthpatterns like `\b` and makes `\s*` just equal to `\s+`.

The difference between cases 1 and 2 is subtle. The case 1 looks morelogical and matches the behavior of Perl and PCRE, but the case 2 isexplicitly documented and tested. This behavior is kept forcompatibility with an ancient re implementation.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Zero-width matching in regexes

Reply via email to