[Guido] > I am not able to dream up any hard cases -- like other posters, > my own use of substring search is usually looking for a short > string in a relatively short piece of text. I doubt even the current > optimizations matter to my uses.
I should have responded to this part differently. What I said was fine ;-) , but it's a mistake to focus exclusively on pathologies here. What Fredrik did was with the aim of significantly speeding utterly ordinary searches. For example, search for "xyz" in "abcdefgh...xyz". Brute force starts comparing at the first location: abcdefgh... xyz The current code compares "c" with "z" fist. They don't match. Now what? It looks at the_next_ character in the haystack, "d". Thanks to preprocessing the needle, it knows that "d" appears nowhere in the needle (actually, the code is so keen on "tiny constant extra space" that preprocessing only saves enough info to get a _probabilitistic_ guess about whether "d" is in the needle, but one that's always correct when "not in the needle" is its guess). For that reason, there's no possible match when starting the attempt at _any_ position that leaves "d" overlapping with the needle. So it can immediately skip even trying starting at text indices 1, 2, or 3. It can jump immediately to try next at index 4: abcdefgh... 0123xyz Then the same kind of thing, seeing that "g" and "z" don't match, and that "h" isn't in the needle. And so on, jumping 4 positions at a time until finally hitting "xyz" at the end of the haystack. The new code gets similar kinds of benefits in _many_ similarly ordinary searches too, but they're harder to explain because they're not based on intuitive tricks explicitly designed to speed ordinary cases. They're more happy consequences of making pathological cases impossible. >From randomized tests so far, it's already clear that the new code is finding more of this nature to exploit in non-pathological cases than the current code. Although that's partly (but only partly) a consequence of Dennis augmenting the new algorithm with a more powerful version of the specific trick explained above (which is an extreme simplification of Daniel Sunday's algorithm, which was also aimed at speeding ordinary searches). _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Y3DFFBHNMHDGRE2GIEMH7XLY5YR6BMKR/ Code of Conduct: http://python.org/psf/codeofconduct/