Marcin 'Qrczak' Kowalczyk writes:

> > > isPrint    c = category is other than [Zl,Zp,Cc,Cf,Cs,Co]
> > 
> > I think Cf (Format Control) and Co (Private Use) should be counted as
> > printable.
> 
> Co - OK.
> Cf - really? I checked which characters are these and they don't look
> much like printable, more like control characters...

Non-printable characters are those for which the applications should
do some effort not to output them to the screen. For example, GNU ls
replaces non-printable characters in filenames with question marks. If
you have a filename with Cf characters in it, IMO they should get
treated like the other characters in the filename, not filtered out.

> > > isSpace    c = one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp]
> > 
> > From that, please exclude those characters of category Zs (Space)
> > which have "noBreak" mentioned in their UnicodeData line.
> 
> Hmm, glibc-2.1.3 says that iswspace(160).

That was wrong, and glibc-2.1.94 has it fixed.

> Perhaps you are right that splitting a line into words should not split
> on U+00A0. Line breaking on characters satisfying isSpace seems to be
> correct

Breaking up words is the primary purpose of isSpace, and in this
context U+00A0 should behave like a symbol. The Unicode tables
consider U+00A0 a space because they are focused on rendering, not
line breaking, and for rendering purposes U+00A0 is a blank glyph like
U+0020.

> So I'm adding isSymbol (and this predicate = isSymbol ch || isPunct ch).

Sounds very reasonable. '@' and '~' are not punctuation, they are
symbols.

> > > isUpper    c = category is one of [Lu,Lt]
> > > isLower    c = category is Ll
> > 
> > The isUpper/isLower categorization should take the toUpper/toLower
> > mappings into accound.
> 
> What do you mean?

Many people use code like

     if isUpper c
       c := toLower c

thinking that "if toLower c != c then isUpper c is true". If you just
look at the categories, you miss some characters.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to