I trust your instincts and powers of analysis here. Maybe MRAB has some useful feedback on the tar in the honey?
On Tue, Nov 28, 2017 at 12:04 PM, Serhiy Storchaka <storch...@gmail.com> wrote: > The two largest problems in the re module are splitting on zero-width > patterns and complete and correct support of the Unicode standard. These > problems are solved in regex. regex has many other features, but they are > less important. > > I want to tell the problem of splitting on zero-width patterns. It already > was discussed on Python-Dev 13 years ago [3] and maybe later. See also > issues: [4], [5], [6], [7], [8]. > > In short it doesn't work. Splitting on the pattern r'\b' doesn't split the > text at boundaries of words, and splitting on the pattern r'\s+|(?<=-)' > will split the text on whitespaces, but will not split words with hypens as > expected. > > In Python 3.4 and earlier: > > >>> re.split(r'\b', 'Self-Defence Class') > ['Self-Defence Class'] > >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class') > ['Self-Defence', 'Class'] > >>> re.split(r'\s*', 'Self-Defence Class') > ['Self-Defence', 'Class'] > > Note that splitting on r'\s*' (0 or more whitespaces) actually split on > r'\s+' (1 or more whitespaces). Splitting on patterns that only can match > the empty string (like r'\b' or r'(?<=-)') never worked, while splitting > > Starting since Python 3.5 splitting on a pattern that only can match the > empty string raises a ValueError (this never worked), and splitting a > pattern that can match the empty string but not only emits a FutureWarning. > This taken developers a time for replacing their patterns r'\s*' to r'\s+' > as they should be. > > Now I have created a final patch [9] that makes re.split() splitting on > zero-width patterns. > > >>> re.split(r'\b', 'Self-Defence Class') > ['', 'Self', '-', 'Defence', ' ', 'Class', ''] > >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class') > ['Self-', 'Defence', 'Class'] > >>> re.split(r'\s*', 'Self-Defence Class') > ['', 'S', 'e', 'l', 'f', '-', 'D', 'e', 'f', 'e', 'n', 'c', 'e', 'C', 'l', > 'a', 's', 's', ''] > > The latter case the result is differ too much from the previous result, > and this likely not what the author wanted to get. But users had two Python > releases for fixing their code. FutureWarning is not silent by default. > > Because these patterns produced errors or warnings in the recent two > releases, we don't need an additional parameter for compatibility. > > But the problem was not just with re.split(). Other functions also worked > not good with patterns that can match the empty string. > > >>> re.findall(r'^|\w+', 'Self-Defence Class') > ['', 'elf', 'Defence', 'Class'] > >>> list(re.finditer(r'^|\w+', 'Self-Defence Class')) > [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 4), > match='elf'>, <re.Match object; span=(5, 12), match='Defence'>, <re.Match > object; span=(13, 18), match='Class'>] > >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class') > '<>S<elf>-<Defence> <Class>' > > After matching the empty string the following character will be skipped > and will be not included in the next match. My patch fixes these functions > too. > > >>> re.findall(r'^|\w+', 'Self-Defence Class') > ['', 'Self', 'Defence', 'Class'] > >>> list(re.finditer(r'^|\w+', 'Self-Defence Class')) > [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(0, 4), > match='Self'>, <re.Match object; span=(5, 12), match='Defence'>, <re.Match > object; span=(13, 18), match='Class'>] > >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class') > '<><Self>-<Defence> <Class>' > > I think this change don't need preliminary warnings, because it change the > behavior of more rarely used patterns. No re tests have been broken. I was > needed to add new tests for detecting the behavior change. > > But there is one spoonful of tar in a barrel of honey. I didn't expect > this, but this change have broken a pattern used with re.sub() in the > doctest module: r'(?m)^\s*?$'. This was fixed by replacing it with > r'(?m)^[^\S\n]+?$'). I hope that such cases are pretty rare and think this > is an avoidable breakage. > > The new behavior of re.split() matches the behavior of regex.split() with > the VERSION1 flag, the new behavior of re.findall() and re.finditer() > matches the behavior of corresponding functions in the regex module > (independently from the version flag). But the new behavior of re.sub() > doesn't match exactly the behavior of regex.sub() with any version flag. It > differs from the old behavior as you can see in the example above, but is > closer to it that to regex.sub() with VERSION1. This allowed to avoid > braking existing tests for re.sub(). > > >>> regex.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class') > > > 'Self:Defence:Class' > > > >>> regex.sub(r'(?V1)(\W+|(?<=-))', r':', 'Self-Defence Class') > > > 'Self::Defence:Class' > >>> re.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class') > 'Self:Defence:Class' > > As re.split() it never matches the empty string adjacent to the previous > match. re.findall() and re.finditer() only don't match the empty string > adjacent to the previous empty string match. In the regex module > regex.sub() is mutually consistent with regex.findall() and > regex.finditer() (with the VERSION1 flag), but regex.split() is not > consistent with them. In the re module re.split() and re.sub() will be > mutually consistent, as well as re.findall() and re.finditer(). This is > more backward compatible. And I don't know reasons for preferring the > behavior of re.findall() and re.finditer() over the behavior of re.split() > in this corner case. > > Would be nice to get this change in 3.7.0a3 for wider testing. Please make > a review of the patch [9] or tell your thoughts about this change. > > [1] https://docs.python.org/3/library/re.html > [2] https://pypi.python.org/pypi/regex/ > [3] https://mail.python.org/pipermail/python-dev/2004-August/047272.html > [4] https://bugs.python.org/issue852532 > [5] https://bugs.python.org/issue988761 > [6] https://bugs.python.org/issue1647489 > [7] https://bugs.python.org/issue3262 > [8] https://bugs.python.org/issue25054 > [9] https://github.com/python/cpython/pull/4471 > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido% > 40python.org > -- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com