[bug #58930] take baby steps toward Unicode

Dave Fri, 14 Aug 2020 20:04:33 -0700

Follow-up Comment #2, bug #58930 (project groff):

[comment #1 comment #1:]
> 1. "U+00A0 NO-BREAK SPACE
> 
> None of these are equivalent to the others. :-/


"\~" and "\ " _shouldn't_ be equivalent; they're documented as behaving
differently.

The input string "\[u00A0]" being equivalent to neither of these is exactly
the problem this plank of this bug report is looking to solve.

It's only the character NO-BREAK SPACE in its Latin-1 form, which groff
accepts as direct input, that groff recognizes and interprets as a nonbreaking
space.  groff_char(7) (which I only now thought to check) says it maps to \~. 
But that appears to be less than 100% accurate:


$ LC_CTYPE=en_US.iso88591 printf ".if '\u00A0'\~' .tm equal\n" | groff
$ 


But the upshot is, however groff interprets a Latin-1 A0, it really ought to
interpret the form of that character emitted by preconv, \[u00A0],
identically.

> 2. The behavior of \: when used as the RHS of a .char request
> does indeed seem a bit strange.

Yeah, I really need to open a separate bug report for this, because it's
unrelated to everything else here.

> 3. Narrow no-break space.  Have you named all of the non-breaking
> spaces in Unicode in this ticket?

No.  I was intentionally trying to keep it simple and minimal.  But it turns
out there are only three:

http://en.wikipedia.org/wiki/Whitespace_character#Unicode

So the only one I didn't cover was U+2007 FIGURE SPACE, which should map to
groff's (already nonbreaking) \0.

> there are bunch of others (hair space, thin space, ideographic space,
> ...) but I don't know what their breaking semantics are in Unicode.

Irrational, IMO.  Unicode considers U+2009 THIN SPACE and
U+200A HAIR SPACE breakable, for no good reason that I can see.  Groff (quite
sensibly, since the concept is sort of absurd) does not offer breaking
versions of these spaces, and the only reason to add them would be strict
compliance with a Unicode property that probably no one who uses those code
points actually wants: I can't think of a single real-world use case for a
breaking thin space (though perhaps this is merely a failure of my
imagination).

This is all another can of worms I intentionally didn't address in what I
intended to be a simple change.

> 4. A non-breaking hyphen would then be something that looks
> like \[hy] but doesn't actually break?

Yes.

> You can just use the character as-is in input.

Ah, I guess you used -Tutf8 output, where that does work.  (Somehow your groff
command got stripped from your comment.)  All other output formats (notably
-Tps and -Tpdf) produce "warning: can't find special character 'u2011'".

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58930>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

[bug #58930] take baby steps toward Unicode

Reply via email to