Charset API, part 2

Dan Sugalski Fri, 13 Aug 2004 01:54:08 -0700

This is part two of the charset API. Part 1 dealt with access and transformation of strings, part two here deals with glyph and codepoint classification.

I'm not sure if these belong in the charset vtable or should be separate. Probably putting them in the vtable's the right (or at least least-wrong) thing to do.

It's probably pretty clear, but there's no language or locale stuff in here. There ought to be, but I'm not sure where to put it, so there isn't. This *does* affect the classification of characters, depending on how you look at it, so there's a possibility we will abstract this out somewhat. We likely need to do that for the regex engine(s) anyway, so that can wait for later.

In the following, XXXX is one of: wordchar, whitespace, digit, punctuation, and newline.

  INTVAL is_XXXX(STRING, glyph_offset)

    Return 1 if the glyph at the offset is in the specified class

  INTVAL find_XXXX(STRING, glyph_offset)

    Return the offset of the first glyph in the string at or after the
    offset which is in the class. -1 means there isn't one.

  INTVAL find_not_XXXX(STRING, glyph_offset)

    Return the offset of the first character at or after the offset which
    is *not* in the specified class

  INTVAL find_word_boundary(STRING, glyph_offset)

    Return the offset of the first character which is after a word
    boundary.

Yes, I can see having particular classes of characters, and classes which are explicitly specified, but I'm not sure those are truly and properly generic, so they're not here. I can certainly see adding in support for that if we think it's appropriate. -- Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Charset API, part 2

Reply via email to