[Python-Dev] Re: Changing Python's string search algorithms

Tim Peters Fri, 16 Oct 2020 18:31:48 -0700

I don't plan on making a series of these posts, just this one, to give
people _some_ insight into why the new algorithm gets systematic
benefits the current algorithm can't.  It splits the needle into two
pieces, u and v, very carefully selected by subtle linear-time needle
preprocessing (and it's quite a chore to prove that a suitable
splitting always exists!):

needle == u + v

There are a whole bunch of things that can be said about this
splitting (and must be said to prove worst-case linear time), but
here's just one for which it's easy to show the benefit: no
(non-empty) prefix of v is a suffix of u.

Now I think it's quite easy to twist your head into knots trying to
see what follows from that, so I'll just give a highly generalizable
concrete example. Suppose v starts with "xyza". Then u cannot end
with "x", "xy", "xyz", or "xyza". If u did, then u would have a suffix
that's also a prefix of v.

A match attempt at a given haystack position starts by matching the
characters in v one at a time, left to right. Here with haystack on
the first line, needle on the second, and "0" denoting "don't care /
unknown / doesn't exist" - it's just _some_ glyph so the example has
some chance of lining up under annoying proportional fonts :-(

Say the first character matches:

00000x000000
0uuuuxyzavvv

And the second and third:

00000xyz0000
0uuuuxyzavvv

But the 4th doesn't:

00000xyzZ000
0uuuuxyzavvv

Is there any possibility of a match if we shift the needle one to the
right? No! If we tried, the last character of u would line up with the
"x" in the haystack, and we already know "x" is not a suffix of u.

How about if we shifted two to the right? No again! And for the same
reason: then the last two characters of u would line up with "xy" in
the haystack, but we already know "xy" is not a suffix of u.

And so on. In general, if we find a mismatch while trying the k'th
(1-based) character of v, we can shift the needle k places to the
right before there's any possibility of matching. In this case, k=4,
so we can jump to:

00000xyzA0000000
00000uuuuxyzavvv

Note that no "extra" storage is needed to exploit this. No character
lookups, no extra expenses in time or space of any kind. Just "if we
mismatch on the k'th try, we can jump ahead k positions". Nothing
special about "xyza" either - this works with any characters at all!

"xyza" isn't by any stretch a hard case regardless. But it goes
_some_ way toward hinting at why potentially hard cases are rendered
toothless too. For example, the needle "x" * 100 is split into

u = "" # yup, empty!
v = needle

Now, e.g., if we find a mismatch at the 83'rd character of v, we can
leap ahead 83 positions immediately.

What about needle "x" * 100 + "y"? I picked that to crush your
premature optimism ;-) Now the splitting is

u = "x" * 100
v = "y"

and the wonderful trick above only applies to a trivial 1-character
string. The "hard part" is all in matching the u piece now, which I'll
leave as an exercise for the reader ;-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/HPMYTQSJ4IXUYUCZE2EYCIF34RTQDW7O/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Changing Python's string search algorithms

Reply via email to