Re: [up for grabs]Two-way string searching

Dmitry Olshansky Mon, 17 Jun 2013 10:15:44 -0700

17-Jun-2013 15:36, monarch_dodra пишет:

On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu wrote:

On 6/16/13 6:16 PM, Dmitry Olshansky wrote:

It have long bugged me that Phobos std.algorithm.find is slow. Far
slower then strstr (esp on *nix where it competes against GLIBC[1]).


The current state of the art in searching substring
with O(1) space requirement is Two-Way algorithm:

http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

which is linear in time.

I could imagine it would be interesting to implement a generic version
as well. Any takers? I'd gladly go for it myself but unfortunately way
too busy on other fronts (I planed to do it for a couple of months
already).

[1] See a relevant bug report and discussion e.g on glibc
http://sourceware.org/bugzilla/show_bug.cgi?id=5514


Awesome idea! Could you please submit an enhancement request for this?
I wonder what kind of ranges that could work on.

Andrei


One of the "problems" is that find is array agnostic, so doesn't know
how to squeeze out all the juice out arrays, such as:
* No bounds check on accesses
* No bounds check on slicing
* vectorized comparisons

I took the existing find(RA, BD) code, and modified it to operate on
find(ARR, ARR).

On average, I'm getting roughly 20%-25% performance improvements
(-release -O -inline), although the result is of course highly dependent
on the tested input.

Goes to say that by addapting the existing algorithm to simply better
exploit arrays, there is already room for good improvements.

Agreed and there are many tricks beyond that. Like making a repeated bitpattern and searching one machine word at a time (if element size <size_t.sizeof).

And this is all nice and have its place too but I'm mostly talking aboutO(N*M) vs O(N).


Anyway, filed:
http://d.puremagic.com/issues/show_bug.cgi?id=10392

Given that string-to-same-width-string boils back down integral array
search, the gains would also be had for strings.

--------
I was also able to squeeze out similar performace boosts for find(R, E),
with minimal code changes, exploiting better iteration semantics based
on the type iterated (range, RA array, or narrow string).

--------
I can start by improving find(R, E), because it is a small but easy and
effective change.

For find(R, R), things are a bit more dicey to properly integrate, so I
don't want to do anything right now.

But the point is that there is still room for substantial improvements
without modifying the algorithm too much...



--
Dmitry Olshansky

Re: [up for grabs]Two-way string searching

Reply via email to