as long as you don't insist on case-insensitive matching, which is probabaly best done after UTF-8 decoding, defeating much of the speed of Boyer-Moore and similar string search algorithms.
Exactly the problem. You're correct, that Boyer-Moore would be quite acceptably fast were it not for case-insensitive matching. Converting to Unicode then applying Boyer-Moore is still substantially faster brute-force string matching.
I can't shake the suspicion that if I organized the data by indexing, I could gain a dramatic increase in speed. That way, the cost of decoding the UTF-8 would be born by the building of the index, not the search itself.
As a non-academic, I'm always amazed at the research that has been done at the University level. I hate to reinvent the wheel.
Best Regards,
Stephen Greenfield
Screenplay Systems, Inc.
www.screenplay.com / www.dramatica.com
-----Original Message-----
From: Markus Kuhn [mailto:[EMAIL PROTECTED]]
Sent: Sunday, April 21, 2002 2:53 PM
To: Stephen Greenfield
Cc: [EMAIL PROTECTED]
Subject: Re: Table-based / inverted index text search algorithms for Unicode / UTF-8?
Stephen,
I'm afraid, I don't know any publications on UTF-8 Boyer-Moore string
search, but I also haven't done a proper literature search on this yet.
A couple of regular expression libaries have been published for UTF-8,
but I haven't looked at these yet, and I don't know, how much effort has
already flown into optimizing these.
Is a byte-wise Boyer-Moore on UTF-8 data really that bad? I understand
that with a high fraction of non-ASCII characters, some initial byte
values would be very common, leading to short jumps only, but the
continuation bytes should have an entropy comparable to that of ASCII
characters and should cause the search to progress equally fast, at
least as long as you don't insist on case-insensitive matching, which is
probabaly best done after UTF-8 decoding, defeating much of the speed of
Boyer-Moore and similar string search algorithms.
Might be an interesting research challenge ...
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
