Stephen, I'm afraid, I don't know any publications on UTF-8 Boyer-Moore string search, but I also haven't done a proper literature search on this yet. A couple of regular expression libaries have been published for UTF-8, but I haven't looked at these yet, and I don't know, how much effort has already flown into optimizing these.
Is a byte-wise Boyer-Moore on UTF-8 data really that bad? I understand that with a high fraction of non-ASCII characters, some initial byte values would be very common, leading to short jumps only, but the continuation bytes should have an entropy comparable to that of ASCII characters and should cause the search to progress equally fast, at least as long as you don't insist on case-insensitive matching, which is probabaly best done after UTF-8 decoding, defeating much of the speed of Boyer-Moore and similar string search algorithms. Might be an interesting research challenge ... Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
