Hello everybody,
I have done small comparison of the 3 algorithms: the original zend_memnstr
implementation, KMP, and Boyer-Moore (inefficient, "book" implementation).

The first part - real life data. I have collected the data about the
parameters passed to ZEND_MEMNSTR (not only while it is called from strpos)
on my own server. Data is collected over 1h ~ 2 000 000 calls. (if somebody
thinks how -> "the hard way" just fwrite() added to zend_memnstr to save in
file data about the call)

Here are some statistics:
- average haystack length: 624.2
- average needle length: 1.9 ! -> 63% of needles of length 1
- avg length of haystacks shorter than avg: 41.0 -> 85% of all haystacks
- avg length of haystacks  longer than avg: 5685.11

I did several tests - both "human readable" and artificial texts and here
are (some) results:

In case of searching of words in text (piece of book copied into test
bench):
For short haystacks (<1000) the original implementation is actually the best
one.
KMP is slower due to preprocessing time and some overhead in the loop.
BM is 10-30 times slower, thus not really applicable.

In case of artificial texts (like search of params in kind of config - taken
real life example from Joomla)
KMP and original implementation are almost the same (the latter a little bit
faster)
BM is still much slower

some results (times in seconds) - only 5 shown. If somebody wants to play
with it - I can post small program with implementation of all 3 algorithms)

ORIGINAL
-------------------
test 1: 2.450000
test 2: 5.380000
test 3: 3.890000
test 4: 2.870000
test 5: 11.840000

KMP
-------------------
test 1: 2.910000
test 2: 11.370000
test 3: 7.000000
test 4: 3.970000
test 5: 11.660000

BM
-------------------
test 1: 35.030000
test 2: 35.460000
test 3: 8.570000
test 4: 30.500000
test 5: 31.860000

For longer text (<5000) KMP and original zend_memnstr are basically similar,
BM starts to show it is better, the real fun is when haystack is loner than
5000 chars - than suddenly BM starts to be really good.

Some examples for texts longer than 3000 chars

ORIGINAL
-------------------
test 1: 0.110000  <- long text with, strpos = 0
test 2: 97.620000 <- really long text (~10000chars, match ~8000)
test 3:4400, 17.220000 <- match ~4400
test 4:0, 157.990000 <- degraded long text
test 5:0, 99.920000

KMP
-------------------
test 1:0, 0.550000
test 2:8596, 36.670000
test 3:4400, 50.610000
test 4:0, 122.000000
test 5:0, 93.800000

BM
-------------------
test 1:0 28.690000
test 2:8596, 23.230000
test 3:4400, 18.420000
test 4:0 38.890000
test 5:0, 157.920000 <- this one is strange!

As I haven't taken into account the information about haystack/needle KMP
seemed to be good choice. But now think about 63% of calls of ZEND_MEMNSTR
with needle_len = 1!!  I think it is the main place for optimization.
Although strpos implements fix for that, some other functions don't. My idea
is than to implement ZEND_MEMNSTR once again in shape:

if (needle_len = 1)
    here just linear sweep
else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests
needed to choose good value)
    original implementation (as it is the best one in this case)
else
    BM/KMP (i think BM will be better in this case, as some people
suggested)

Then cleanup all the functions that use zend_memnstr, as they can be this
way simpler.

I did similar test for strripos - in this case KMP is good enough (the
original implementation ALWAYS run in time o(nm) - it is just memcmp at
eachposition no matter if the first chars match, thus is outperformed by KMP
even for short strings). Implementing BM here is not necessary in my opinion
- the match is usually quite close to the end of haystack (average ~80% of
haystack length). But fix for needle_len = 1 should be left as here also
many searches are with short needles.

Do you think it is good idea to rewrite it this way?

Cheers,
Michal

Reply via email to