Stephen,

I'm afraid, I don't know any publications on UTF-8 Boyer-Moore string
search, but I also haven't done a proper literature search on this yet.
A couple of regular expression libaries have been published for UTF-8,
but I haven't looked at these yet, and I don't know, how much effort has
already flown into optimizing these.

Is a byte-wise Boyer-Moore on UTF-8 data really that bad? I understand
that with a high fraction of non-ASCII characters, some initial byte
values would be very common, leading to short jumps only, but the
continuation bytes should have an entropy comparable to that of ASCII
characters and should cause the search to progress equally fast, at
least as long as you don't insist on case-insensitive matching, which is
probabaly best done after UTF-8 decoding, defeating much of the speed of
Boyer-Moore and similar string search algorithms.

Might be an interesting research challenge ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to