[bug #58930] take baby steps toward Unicode

2020-08-15 Thread Dave
Follow-up Comment #7, bug #58930 (project groff):

[comment #4 comment #4:]
>  just lamenting the total disjunctivity of the set.

That two of the three, intended to serve different purposes, are disjunct
seems more laudable than lamentable.  But I'm not here to police your
feelings.

> I can't think of a more appropriate mapping for it.

Well, if there were a more appropriate mapping for \[u00A0], that mapping
should also apply to the Latin-1 A0.  They're the same character, just with
different input representations.

Speaking more generally, for a Latin-1 input file, "groff latin1.txt" and
"groff -Klatin1 latin1.txt" should produce identical output.  Presently, for
this character they do not.

> Might as well sweep that one into this report, then.

As long as it doesn't change the billing, I won't complain about you doing
more work than I asked for.

> tmac/pdf.tmac sources tmac/ps.tmac so the fix only has to be made in one
place.

I should have said "notably but not limited to -Tps and -Tpdf."  Fixing this
in the device-specific tmac file then requires duplicating that fix for at
least -Tascii, -Tlatin1, and the various -TX* devices, and I couldn't even
begin to guess about the more obscure legacy devices.

On the one hand, I get that \[u2011] is a character, and characters are mapped
to glyphs, and glyphs reside in fonts, and fonts are device-specific, so some
device-specific code seems a reasonable place to handle it.

But zooming out, the semantics of U+2011 NON-BREAKING HYPHEN are not
device-specific; as an output glyph, it is always identical (as you note) to
\[hy], or \[u2010].  What separates them is its behavior--and this should be
the same across all devices, suggesting it should be handled in a
device-independent section of the code.

I mean, I don't want to back-seat drive, and tell you your very simple
solution, which covers most output formats most people care about, isn't good
enough--except I guess I do, because that's kind of what I'm doing.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #58958] undocumented (or just broken) inability of .char to map to \:

2020-08-15 Thread Dave
Follow-up Comment #2, bug #58958 (project groff):

[comment #1 comment #1:]
> Alas, you didn't fix my typo.

Copy/paste failed to fix it.  I failed to notice it altogether.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #58930] take baby steps toward Unicode

2020-08-15 Thread G. Branden Robinson
Follow-up Comment #6, bug #58930 (project groff):


[comment #5 comment #5:]
> On further investigation, it appears in fact to be 0% accurate.  See bug
#58962.

groff_char(7) is _full_ of problems with accuracy.

It's on my (s)hit list.  I recently fixed up the introductory material but it
needs a lot more work.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #58930] take baby steps toward Unicode

2020-08-15 Thread Dave
Follow-up Comment #5, bug #58930 (project groff):

[comment #2 comment #2:]
> groff_char(7) (which I only now thought to check) says it
> maps to \~.  But that appears to be less than 100% accurate:

On further investigation, it appears in fact to be 0% accurate.  See bug
#58962.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #58962] Latin-1 NO-BREAK SPACE does not behave as documented

2020-08-15 Thread Dave
URL:
  

 Summary: Latin-1 NO-BREAK SPACE does not behave as documented
 Project: GNU troff
Submitted by: barx
Submitted on: Sat 15 Aug 2020 12:25:30 PM CDT
Category: None
Severity: 3 - Normal
  Item Group: Incorrect behaviour
  Status: None
 Privacy: Public
 Assigned to: None
 Open/Closed: Open
 Discussion Lock: Any
 Planned Release: None

___

Details:

(Another bug report spawned from the discovery process of bug #58930.)

Quoth groff_char(7): "the ISO latin1 _no-break space_ is mapped to `\~', the
stretchable space character."

An eminently sensible mapping.  Oh, if only it were so.

In fact, the Latin-1 no-break space (character 160 decimal, A0 hex):

* behaves the same as "\ ", the nonstretchable nonbreaking space character
* matches neither "\ " nor "\~" in an output-equivalency conditional

Examining these in detail:

=== Behavior ===

Consider an input file with one instance of the string "<>", representing a
nonbreaking space.  sed can convert this string to the various types of
nonbreaking space under consideration (the two escapes and the raw Latin-1
character), and the typeset results compared by seeing which ones produce
identical PostScript output.


$ cat t0
Lorem ipsum dolor sit amet, consectetur<>adipiscing elit, sed
do eiusmod tempor incididunt ut labore et dolore magna aliqua.
$ # Baseline test, for escapes expected to be different:
$ diff <(sed 's/<>/\\ /' t0 | groff) <(sed 's/<>/\\~/' t0 | groff) | wc
  8  68 403
$ # Output expected to be the same based on what the docs say:
$ diff <(sed 's/<>/\\~/' t0 | groff) <(sed 's/<>/\xA0/' t0 | groff) | wc
  8  68 403
$ # Output that turns out to be the same:
$ diff <(sed 's/<>/\\ /' t0 | groff) <(sed 's/<>/\xA0/' t0 | groff) | wc
  0   0   0
$ 


I'm filing this as "Incorrect behavio[u]r" rather than "Documentation" because
I believe the documented behavior is more sensible than the actual behavior. 
But that's a judgment call and open to debate.

=== Equivalency conditional ===

Either way, if Latin-1 A0 behaves the same as one of "\ " or "\~", the
output-equivalency conditional operator (rendered as 'XXX'YYY' in the info
manual, though a host of characters besides single quotes can be used) ought
to recognize this.  But this operator claims the output of character A0 is
equivalent to neither one (first observed in comment #2 of the aforementioned
bug ).


$ printf ".if '\xA0'\~' .tm equal\n" | groff
$ printf ".if '\xA0'\ ' .tm equal\n" | groff
$ 


(Granted, the documentation muddies what this operator is actually testing. 
The info manual is clear about 'XXX'YYY', saying this is "True if the output
produced by XXX is equal to the output produced by YYY."  But groff(7) is less
clear, saying that the test 's1's2' is "True if string s1 is identical to
string s2," which implies it's comparing _input_ strings.  Were that the case,
you'd expect both the above tests to be false... but you'd also expect
'\[em]'\[u2014]' to be false, which it isn't.)




___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #58933] [PATCH] doc/groff.texi, man/groff.7.man: clarify description of \

2020-08-15 Thread Peter Schaffter
Follow-up Comment #3, bug #58933 (project groff):


> It's Peter's term; I'm just a fanboy.  I agree that "zero-width" seems
redundant, but maybe he can think of something additional it communicates.

It's the definition used by Ossana and Kernighan in the cstr54, not mine,
though they switch it around.  I'm inclined to think zero-width is not
redundant, since a non-printing character could have a width.  Zero-width,
non-printing clarifies this.  I'd rather risk overstating than leaving room
for doubt. I suspect that was O's reasoning, too. 

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/