UTF-8?

Stephen Greenfield Sun, 21 Apr 2002 15:10:05 -0700

Title: RE: Table-based / inverted index text search algorithms for Unicode / UTF-8?

as long as you don't insist on case-insensitive matching, which is probabaly best done after UTF-8 decoding, defeating much of the speed of Boyer-Moore and similar string search algorithms.

Exactly the problem. You're correct, that Boyer-Moore would be quite acceptably fast were it not for case-insensitive matching. Converting to Unicode then applying Boyer-Moore is still substantially faster brute-force string matching.

I can't shake the suspicion that if I organized the data by indexing, I could gain a dramatic increase in speed. That way, the cost of decoding the UTF-8 would be born by the building of the index, not the search itself.

As a non-academic, I'm always amazed at the research that has been done at the University level. I hate to reinvent the wheel.

Best Regards,

Stephen Greenfield

Screenplay Systems, Inc.

www.screenplay.com / www.dramatica.com

-----Original Message-----
From: Markus Kuhn [mailto:[EMAIL PROTECTED]]
Sent: Sunday, April 21, 2002 2:53 PM
To: Stephen Greenfield
Cc: [EMAIL PROTECTED]
Subject: Re: Table-based / inverted index text search algorithms for Unicode / UTF-8?

Stephen,

I'm afraid, I don't know any publications on UTF-8 Boyer-Moore string

search, but I also haven't done a proper literature search on this yet.

A couple of regular expression libaries have been published for UTF-8,

but I haven't looked at these yet, and I don't know, how much effort has

already flown into optimizing these.

Is a byte-wise Boyer-Moore on UTF-8 data really that bad? I understand

that with a high fraction of non-ASCII characters, some initial byte

values would be very common, leading to short jumps only, but the

continuation bytes should have an entropy comparable to that of ASCII

characters and should cause the search to progress equally fast, at

least as long as you don't insist on case-insensitive matching, which is

probabaly best done after UTF-8 decoding, defeating much of the speed of

Boyer-Moore and similar string search algorithms.

Might be an interesting research challenge ...

Markus

Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK

Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>

RE: Table-based / inverted index text search algorithms for Unicode / UTF-8?

Reply via email to