I don't plan on making a series of these posts, just this one, to give people _some_ insight into why the new algorithm gets systematic benefits the current algorithm can't. It splits the needle into two pieces, u and v, very carefully selected by subtle linear-time needle preprocessing (and it's quite a chore to prove that a suitable splitting always exists!):
needle == u + v There are a whole bunch of things that can be said about this splitting (and must be said to prove worst-case linear time), but here's just one for which it's easy to show the benefit: no (non-empty) prefix of v is a suffix of u. Now I think it's quite easy to twist your head into knots trying to see what follows from that, so I'll just give a highly generalizable concrete example. Suppose v starts with "xyza". Then u cannot end with "x", "xy", "xyz", or "xyza". If u did, then u would have a suffix that's also a prefix of v. A match attempt at a given haystack position starts by matching the characters in v one at a time, left to right. Here with haystack on the first line, needle on the second, and "0" denoting "don't care / unknown / doesn't exist" - it's just _some_ glyph so the example has some chance of lining up under annoying proportional fonts :-( Say the first character matches: 00000x000000 0uuuuxyzavvv And the second and third: 00000xyz0000 0uuuuxyzavvv But the 4th doesn't: 00000xyzZ000 0uuuuxyzavvv Is there any possibility of a match if we shift the needle one to the right? No! If we tried, the last character of u would line up with the "x" in the haystack, and we already know "x" is not a suffix of u. How about if we shifted two to the right? No again! And for the same reason: then the last two characters of u would line up with "xy" in the haystack, but we already know "xy" is not a suffix of u. And so on. In general, if we find a mismatch while trying the k'th (1-based) character of v, we can shift the needle k places to the right before there's any possibility of matching. In this case, k=4, so we can jump to: 00000xyzA0000000 00000uuuuxyzavvv Note that no "extra" storage is needed to exploit this. No character lookups, no extra expenses in time or space of any kind. Just "if we mismatch on the k'th try, we can jump ahead k positions". Nothing special about "xyza" either - this works with any characters at all! "xyza" isn't by any stretch a hard case regardless. But it goes _some_ way toward hinting at why potentially hard cases are rendered toothless too. For example, the needle "x" * 100 is split into u = "" # yup, empty! v = needle Now, e.g., if we find a mismatch at the 83'rd character of v, we can leap ahead 83 positions immediately. What about needle "x" * 100 + "y"? I picked that to crush your premature optimism ;-) Now the splitting is u = "x" * 100 v = "y" and the wonderful trick above only applies to a trivial 1-character string. The "hard part" is all in matching the u piece now, which I'll leave as an exercise for the reader ;-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HPMYTQSJ4IXUYUCZE2EYCIF34RTQDW7O/ Code of Conduct: http://python.org/psf/codeofconduct/