[Python-Dev] Re: Changing Python's string search algorithms

Tim Peters Thu, 15 Oct 2020 07:28:29 -0700

[Guido]
> I am not able to dream up any hard cases -- like other posters,
> my own use of substring search is usually looking for a short
> string in a relatively short piece of text. I doubt even the current
> optimizations matter to my uses.

I should have responded to this part differently. What I said was
fine ;-) , but it's a mistake to focus exclusively on pathologies
here. What Fredrik did was with the aim of significantly speeding
utterly ordinary searches. For example, search for "xyz" in
"abcdefgh...xyz".

Brute force starts comparing at the first location:

abcdefgh...
xyz

The current code compares "c" with "z" fist. They don't match. Now
what? It looks at the_next_ character in the haystack, "d". Thanks
to preprocessing the needle, it knows that "d" appears nowhere in the
needle (actually, the code is so keen on "tiny constant extra space"
that preprocessing only saves enough info to get a _probabilitistic_
guess about whether "d" is in the needle, but one that's always
correct when "not in the needle" is its guess).

For that reason, there's no possible match when starting the attempt
at _any_ position that leaves "d" overlapping with the needle. So it
can immediately skip even trying starting at text indices 1, 2, or 3.
It can jump immediately to try next at index 4:

abcdefgh...
0123xyz

Then the same kind of thing, seeing that "g" and "z" don't match, and
that "h" isn't in the needle. And so on, jumping 4 positions at a
time until finally hitting "xyz" at the end of the haystack.

The new code gets similar kinds of benefits in _many_ similarly
ordinary searches too, but they're harder to explain because they're
not based on intuitive tricks explicitly designed to speed ordinary
cases. They're more happy consequences of making pathological cases
impossible.

>From randomized tests so far, it's already clear that the new code is
finding more of this nature to exploit in non-pathological cases than
the current code. Although that's partly (but only partly) a
consequence of Dennis augmenting the new algorithm with a more
powerful version of the specific trick explained above (which is an
extreme simplification of Daniel Sunday's algorithm, which was also
aimed at speeding ordinary searches).
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/Y3DFFBHNMHDGRE2GIEMH7XLY5YR6BMKR/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Changing Python's string search algorithms

Reply via email to