[Python-Dev] Re: Changing Python's string search algorithms

David Mertz Tue, 13 Oct 2020 21:27:04 -0700

I'm sure that the large majority of the string searches I've done are in
Larry's tiny category.


However, I also think that big data needs are increasing, and things like
FASTA files can be enormously large texts that one has good reasons to
search on.

If there is a heuristic switch between algorithms, it seems like it should
threshold on needle, not on haystack. Or possibly some more complex
function of both. But if I understand the two-way approach correctly (which
I probably don't), there's not really much gain for small needle.

On Wed, Oct 14, 2020, 12:09 AM Larry Hastings <[email protected]> wrote:

>
> On 10/13/20 9:54 AM, Tim Peters wrote:
>
> Due to the natures of the old and new algorithms, neither will be
> faster or slower in all cases, the newer one should never be
> dramatically slower than the old one, the new one will be
> spectacularly faster in some cases, and "in general" it seems
> impossible to guess in advance. It depends on the value the fancier
> preprocessing can extract from the pattern versus the higher
> preprocessing cost, and that in turn depends on details of exactly
> what the patterns and texts to be searched are.
>
>
> My off-the-top-of-my-head reaction: I've always assumed that most string
> searching is tiny.  Like, searching for a < 10 character string in a < 256
> character string.  That's what I'm always doing, at least.  So while I'd
> obviously like to see us address the pathological cases cited--and thanks
> to Dennis Sweeney for the PRs!--I'd hope that it doesn't make these small
> searches any slower.  Your email didn't give me a sense of how much
> additional preprocessing these new algorithms require; the fact that
> they're O(1) space suggests they aren't bad.  But if you do lots and lots
> of dinky searches surely it would add up.
>
> Looking at the PR, I see there's a special case for searching for a single
> character, and then cases for forward-search and reverse-search.  I was
> wondering if I'd find a heuristic like "if the string you're searching in
> is < 2k, use the old search algorithm".  Is that worth pursuing?  It
> doesn't seem like the maintenance cost would be that high; I don't think
> anybody has touched that code in years.  (No surprise, the Effbot did a
> good job.)
>
>
> */arry*
> _______________________________________________
> Python-Dev mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/D5GYO2FVUW762RGZ5NMYKYM7PPWWRDIS/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/4IEDAS5QAHF53IV5G3MRWPQAYBIOCWJ5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Changing Python's string search algorithms

Reply via email to