Re: looking for a case-insensitive string search algorithm

Ken Krugler Thu, 16 Sep 2004 15:56:52 -0700

Hi Stephen,

Does anyone know of a good way to perform a case-insensitive search of a large text string for the first occurrence of a small text string - using code that supports non-ASCII character encodings (possibly multiple bytes per character) - and which doesn't restrict matches to beginning at the beginning of a word? (I.e. something just like TxtFindString() except allowing matches that start one or more characters past the beginning of a word) Alternatively, even a case-insensitive version of StrChr() would be helpful... Thanks in advance, Stephen

p.s. no need to point me to approaches that assume ASCII characters, since I understand case-insensitivity in that context - it's when we get into alternate character encodings that I don't have a good grasp of what 'case insensitivity' (or accent insensitivity) might mean(!).

Normally the default behavior for case-insensitive ("weakly equal" is a better term) string matching is to treat two characters as being equal if their primary sort value is the same.

Unfortunately there's no easy way for a Palm application to implement this, for various reasons. For example, in Japanese you can get multiple characters (not just multiple bytes) combining to form a single "sort value" that is then used for comparison purposes. So even if you wrote code to call TxtCaselessCompare with every valid code point, and used the resulting ordering, it wouldn't be correct.

If this is critical, then I'd look at the ICU open source code/data as a starting point, as they've implemented this kind of search support. But you'd have to convert it to work with device encoding, as otherwise converting all of the data to Unicode during a search would make things very slow (unless you can store the data being searched as UTF-16).

At 12:00am -0700 9/16/04, Veronica Loell wrote:

I can't really see how the meaning of 'case insensitivity' can be
dependent on the character encoding. No matter how an alphabeth is
encoded the mathing characters will be the same. What does make a
difference here is the locale/language, unless you assume that is
decided by the encoding used?

As you move beyond ASCII, the meaning of "weakly equal" expands significantly. For example, in Japanese you can have a base Hiragana character such a "ha" followed by a sign-extension (chou-on) character. This changes the sound of the vowel (makes it long), but the resulting two characters combine to form a single unit that is weakly equal to just the base "ha" character, as well as the Katakana "ha" character, and the single-byte (half-width) "ha" character, etc.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

--
For information on using the Palm Developer Forums, or to unsubscribe, please see 
http://www.palmos.com/dev/support/forums/

Re: looking for a case-insensitive string search algorithm

Reply via email to