Re: [Python-Dev] Zero-width matching in regexes

MRAB Tue, 05 Dec 2017 16:20:07 -0800

On 2017-12-05 20:26, Terry Reedy wrote:

On 12/4/2017 6:21 PM, MRAB wrote:
I've finally come to a conclusion as to what the "correct" behaviour ofzero-width matches should be: """always return the first match, butnever a zero-width match that is joined to a previous zero-width match""".
Is this different from current re or regex?

Sometimes yes.

It's difficult to know how a zero-width match should be handled.

The normal way that, say, findall works is that it searches for a matchand then continues from where it left off.

If at any point it matched an empty string, it would stall because theend of the match is also the start of the match.


How should that be handled?

The old buggy behaviour of the re module was to just advance by onecharacter after a zero-width match, which can result in a characterbeing skipped and going missing.

A solution is to prohibit a zero-width match that's joined to theprevious match, but I'm not convinced that that's correct.

If it's about to return a zero-width match that's joined to a previouszero-width match, then backtrack and keep on looking for a match.
Example:

 >>> print([m.span() for m in re.finditer(r'|.', 'a')])
[(0, 0), (0, 1), (1, 1)]

re.findall, re.split and re.sub should work accordingly.
If re.finditer finds n matches, then re.split should return a list ofn+1 strings and re.sub should make n replacements (excepting maxsplit,etc.).

_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Zero-width matching in regexes

Reply via email to