Re: [Groff] Character class query

Colin Watson Mon, 02 Mar 2009 16:11:14 -0800

On Mon, Mar 02, 2009 at 03:53:40PM +0100, Werner LEMBERG wrote:
> Colin Watson wrote:
> > Obviously a class that consisted of more than just a single
> > character wouldn't have a Unicode codepoint or a glyph number or
> > anything, and \[CJKprepunct] wouldn't produce any output, but
> > '.cflags 2 \[CJKprepunct]' or whatever would be a sensible thing to
> > write.
> 
> We could introduce a naming convention for character classes, say, to
> start such names with a dot, having the word `class' in its name, or
> something similar.  Since the list of groff entities is not
> extensible, we have a broad range of possibilities.  We could even use
> names similar to POSIX character ranges, e.g.,
> 
>   .char \C'[:digit:]' 0123456789
>   abc\C'[:digit:]'abc


I like the hint of "class" in \C even though it obviously isn't the only
use nor the original intent. POSIX character ranges are an interesting
comparison here although I always found the syntax a little cumbersome
to type; lacking other kinds of ranges, I think we could drop the
colons.

> > A simple initial implementation would essentially just change the
> > accessor methods of 'class charinfo' to look through all registered
> > character classes for ones that include the current character
> > (intentionally vague here as I haven't yet worked out how to deal
> > with ranges of Unicode codepoints that haven't been given entity
> > indices).
> 
> This should probably support fall-back classes too, similar to the
> current mechanism for ordinary entities.

I've been trying to figure out a class analogue for this, and not
getting very far. It can't be a straight equivalent to fallback
characters since we would need to decide on the appropriate class to use
after selecting the output glyph for a character using the existing
symbol lookup mechanism, rather than going through the same kind of
mechanism again.

I was instead envisaging a system in which a character can be in
multiple classes. On the input side, the character's flags are
determined by the bitwise-or of the flags applied to the character
itself and to all of its classes, and other properties of the character
are determined by the smallest class containing the character (with some
arbitrary resolution such as last-definition-wins in case of conflict).
On the output side, properties are determined as in the latter clause of
the previous sentence.

I suppose that, on the output side, we need to consider classes defined
in different font files. Is that what you were referring to? We would
normally only use classes defined in the current font, but would also
need to check classes defined in special fonts installed with .fspecial
or .special.

Aside from this I confess that I'm having trouble seeing the usefulness
or semantics of fallback classes. If my thoughts above don't cover it,
would you mind elaborating?


The one obvious flaw I see in the above is that East Asian line-breaking
algorithms work the other way round from Western ones: lines may break
anywhere unless prohibited. Representing this efficiently using the
current set of available cflags would mean a class for the whole CJK
range with .cflags 70, and then a class for no-break-before characters
with .cflags 68 and one for no-break-after characters with .cflags 66.
This may suggest that it's better to use the smallest enclosing class
for flags as well as for other properties.

Thanks,

-- 
Colin Watson                                       [[email protected]]

Re: [Groff] Character class query

Reply via email to