On Mon, Mar 02, 2009 at 03:53:40PM +0100, Werner LEMBERG wrote: > Colin Watson wrote: > > Obviously a class that consisted of more than just a single > > character wouldn't have a Unicode codepoint or a glyph number or > > anything, and \[CJKprepunct] wouldn't produce any output, but > > '.cflags 2 \[CJKprepunct]' or whatever would be a sensible thing to > > write. > > We could introduce a naming convention for character classes, say, to > start such names with a dot, having the word `class' in its name, or > something similar. Since the list of groff entities is not > extensible, we have a broad range of possibilities. We could even use > names similar to POSIX character ranges, e.g., > > .char \C'[:digit:]' 0123456789 > abc\C'[:digit:]'abc
I like the hint of "class" in \C even though it obviously isn't the only use nor the original intent. POSIX character ranges are an interesting comparison here although I always found the syntax a little cumbersome to type; lacking other kinds of ranges, I think we could drop the colons. > > A simple initial implementation would essentially just change the > > accessor methods of 'class charinfo' to look through all registered > > character classes for ones that include the current character > > (intentionally vague here as I haven't yet worked out how to deal > > with ranges of Unicode codepoints that haven't been given entity > > indices). > > This should probably support fall-back classes too, similar to the > current mechanism for ordinary entities. I've been trying to figure out a class analogue for this, and not getting very far. It can't be a straight equivalent to fallback characters since we would need to decide on the appropriate class to use after selecting the output glyph for a character using the existing symbol lookup mechanism, rather than going through the same kind of mechanism again. I was instead envisaging a system in which a character can be in multiple classes. On the input side, the character's flags are determined by the bitwise-or of the flags applied to the character itself and to all of its classes, and other properties of the character are determined by the smallest class containing the character (with some arbitrary resolution such as last-definition-wins in case of conflict). On the output side, properties are determined as in the latter clause of the previous sentence. I suppose that, on the output side, we need to consider classes defined in different font files. Is that what you were referring to? We would normally only use classes defined in the current font, but would also need to check classes defined in special fonts installed with .fspecial or .special. Aside from this I confess that I'm having trouble seeing the usefulness or semantics of fallback classes. If my thoughts above don't cover it, would you mind elaborating? The one obvious flaw I see in the above is that East Asian line-breaking algorithms work the other way round from Western ones: lines may break anywhere unless prohibited. Representing this efficiently using the current set of available cflags would mean a class for the whole CJK range with .cflags 70, and then a class for no-break-before characters with .cflags 68 and one for no-break-after characters with .cflags 66. This may suggest that it's better to use the smallest enclosing class for flags as well as for other properties. Thanks, -- Colin Watson [[email protected]]
