I'm not sure if these belong in the charset vtable or should be separate. Probably putting them in the vtable's the right (or at least least-wrong) thing to do.
It's probably pretty clear, but there's no language or locale stuff in here. There ought to be, but I'm not sure where to put it, so there isn't. This *does* affect the classification of characters, depending on how you look at it, so there's a possibility we will abstract this out somewhat. We likely need to do that for the regex engine(s) anyway, so that can wait for later.
In the following, XXXX is one of: wordchar, whitespace, digit, punctuation, and newline.
INTVAL is_XXXX(STRING, glyph_offset)
Return 1 if the glyph at the offset is in the specified class
INTVAL find_XXXX(STRING, glyph_offset)
Return the offset of the first glyph in the string at or after the offset which is in the class. -1 means there isn't one.
INTVAL find_not_XXXX(STRING, glyph_offset)
Return the offset of the first character at or after the offset which is *not* in the specified class
INTVAL find_word_boundary(STRING, glyph_offset)
Return the offset of the first character which is after a word boundary.
Yes, I can see having particular classes of characters, and classes which are explicitly specified, but I'm not sure those are truly and properly generic, so they're not here. I can certainly see adding in support for that if we think it's appropriate.
--
Dan
--------------------------------------it's like this------------------- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk