[Steven D'Aprano <st...@pearwood.info>]
> After this thread, I no longer trust that "easy" regexes will do what
> they "obviously" look like they should do :-(
>
> I'm not trying to be funny or snarky. I *thought* I had a reasonable
> understanding of regexes, and now I have learned that I don't, and that
> the regexes I've been writing don't do what I thought they did, and
> presumedly the only reason they haven't blown up in my face (either
> performance-wise, or the wrong output) is blind luck.

Reading Friedl's book is a cure for the confusion, but not for the angst ;-)

I believe the single most practical addition in recent decades has
been the introduction of "possessive quantifiers" This is a variant of
the "greedy" quantifiers that does what most people at the start
_believe_ they do: one-and-done. After its initial match, backtracking
into it fails. So, e.g., \s++ matches the longest string of whitespace
at the time, period. Why "++"? Regexps ;-) It's essentially gibberish
syntax that previously didn't have a sensible meaning.

For example,

>>> regex.search("^x+[a-z]{4}k", "xxxxxk")
<regex.Match object; span=(0, 6), match='xxxxxk'>

is what we're used to if we're paying attention: sucking up as many
x's as possible fails to match (there's nothing for [a-z]{4} to match
except the trailing "k"). But we keep backtracking into it, trying to
match one less "x" at a time, until [a-z]{4} finally matches the
rightmost 4 x's.

But make it possessive and the match as a whole  fails right away:

>>> regex.search("^x++[a-z]{4}k", "xxxxxk")

++ refuses to give back any of what it matched the first time.

At this point there are probably more regexp engines that support this
feature than don't. Python's re does not, but the regex extension
does., Cutting unwanted chances for backtracking greatly cuts the
chance of stumbling into timing disasters.

Where does that leave Python:? Pretty much aging itself into
obsolescence. Regexps keep "evolving", it appears Fredrik lost
interest in keeping up long before he died, and nobody else has
stepped up. regex _has_ kept up, but isn't in the core. So "install
regex" is ever more the best advice.

Note that just slamming possessive quantifiers into CPython's engine
isn't a good approach for more than just the obvious reasons:
possessive quantifiers are themselves just syntax sugar (or chili
peppers) for one instance of a more general new feature, "atomic
groups". Another that's all but a de facto industry standard now,
which Python's re doesn't support (but regex does). Putting just part
of that in is half-assed.


> Now I have *three* problems :-(

You're quite welcome ;-)
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/SRB4XQMSUX5VCEJDTMOESD4E5ROQTAZN/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to