: Relocate invalid input material.

G. Branden Robinson Mon, 02 Jun 2025 02:53:59 -0700

gbranden pushed a commit to branch master
in repository groff.

commit a549e33e9e75756be2ee922e8c4d807b41066765
Author: G. Branden Robinson <g.branden.robin...@gmail.com>
AuthorDate: Mon Jun 2 03:11:19 2025 -0500


    doc/*,man/*: Relocate invalid input material.
    
    It's long nagged me that the discussion of invalid input characters was
    stuffed into the presentation of "identifiers".  Move it to the "Input
    Format" node of our Texinfo manual and groff(7).
    
    Also modestly recast for clarity and to favor active voice over passive.
---
 doc/groff.texi.in | 158 ++++++++++++++++++++++++++++++++++++------------------
 man/groff.7.man   | 123 +++++++++++++++++++++++++-----------------
 2 files changed, 181 insertions(+), 100 deletions(-)

diff --git a/doc/groff.texi.in b/doc/groff.texi.in
index bee66c930..9ffa82048 100644
--- a/doc/groff.texi.in
+++ b/doc/groff.texi.in
@@ -464,7 +464,7 @@ Documentation License''.
 @title groff
 @subtitle The GNU implementation of @code{troff}
 @subtitle version @VERSION@
-@subtitle May 2025
+@subtitle June 2025
 @author Trent@tie{}A.@: Fisher
 @author Werner Lemberg
 @author G.@tie{}Branden Robinson
@@ -5156,6 +5156,7 @@ inter-sentence space.
 * Tabs and Leaders::
 * Requests and Macros::
 * Macro Packages::
+* Input Format::
 * Input Encodings::
 * Input Conventions::
 @end menu
@@ -5655,7 +5656,7 @@ minimize forward references.
 
 @c ---------------------------------------------------------------------
 
-@node Macro Packages, Input Encodings, Requests and Macros, Text
+@node Macro Packages, Input Format, Requests and Macros, Text
 @subsection Macro Packages
 @cindex macro package
 @cindex package, macro
@@ -5675,21 +5676,104 @@ package can load it with the @code{mso} (``macro 
source'') request.
 
 @c ---------------------------------------------------------------------
 
-@c TODO: Move a lot of this node to the "Invoking groff" chapter.  Some
-@c of the discussion is better placed in discussion of output drivers
-@c (e.g., what character encodings _they_ support for output and their
-@c responsibility for converting to them) as well.
+@c BEGIN Keep roughly parallel with groff(7) section "Input Format".
+@node Input Format, Input Encodings, Macro Packages, Text
+@subsection Input Format
+
+@cindex Unicode
+Organize input to
+GNU
+@command{troff} @c GNU
+into lines separated by the Unix newline character
+(@code{U+000A}),
+using the character encoding it recognizes:
+ISO@tie{}Latin-1 (8859-1).
+We recommend use of ISO@tie{}646:1991@tie{}IRV (US-ASCII)
+or (equivalently) the Basic Latin subset
+of ISO@tie{}10646 (Unicode);
+see
+@cite{groff_char@r{(7)}}.
+
+@cindex invalid input characters
+@cindex input characters, invalid
+@cindex characters, invalid input
+A subset of control characters
+(from the sets ``C0 Controls'' and ``C1 Controls''
+as Unicode describes them)
+are invalid as input characters.
+GNU
+@command{troff} @c GNU
+discards such characters instead of interpreting them.@footnote{It also
+emits a warning in category
+@samp{input}.
+@xref{Warnings}.}
+It processes
+a character sequence ``foo'',
+followed by an invalid
+character and then ``bar'',
+as ``foobar''.
+
+Invalid input characters comprise
+@code{0x00},
+@code{0x0B},
+@code{0x0D}--@code{0x1F},
+and
+@code{0x80}--@code{0x9F}.@footnote{Historically,
+control characters like
+@acronym{ASCII}
+@code{STX},
+@code{ETX},
+and
+@code{BEL}
+(@key{Control+B},
+@key{Control+C},
+and
+@key{Control+G},
+respectively)
+have been observed in
+@code{roff}
+documents,
+particularly in macro packages employing them as delimiters
+with the output comparison operator
+to try to avoid collisions
+with the content of arbitrary user-supplied parameters
+(@pxref{Operators in Conditionals}).
+We discourage this expedient;
+in
+GNU
+@command{troff} @c GNU
+it is unnecessary
+(outside of compatibility mode)
+because the program parses delimited arguments
+at a different input level than their surrounding context.
+@xref{Implementation Differences}.}
+GNU
+@command{troff} @c GNU
+uses some of these code points for internal purposes,
+making non-trivial the extension of the program
+to accept UTF-8
+or other encodings that use characters from these ranges.
+@c END Keep roughly parallel with groff(7) section "Input Format".
+
+@c ---------------------------------------------------------------------
 
 @c BEGIN Keep roughly parallel with groff_tmac(5) section "Input
 @c Encodings".
-@node Input Encodings, Input Conventions, Macro Packages, Text
+@node Input Encodings, Input Conventions, Input Format, Text
 @subsection Input Encodings
 
-The @command{groff} command's @option{-k} option calls the
-@command{preconv} preprocessor to perform input character encoding
-conversions.  Input to the GNU @code{troff} formatter itself, on the
-other hand, must be in a single-byte encoding compatible with @w{ISO
-646:1991 IRV} (US-@acronym{ASCII}).
+The
+@command{groff}
+command's
+@option{-k}
+option calls the
+@command{preconv}
+preprocessor
+to perform input character encoding conversions to satisfy
+GNU
+@command{troff}'s
+requirement of a single-byte encoding compatible with
+@w{ISO 646:1991 IRV} (US-@acronym{ASCII}).
 
 Localization influences automatic hyphenation
 in two distinct but related respects.
@@ -5754,11 +5838,13 @@ To use @w{KOI8-R}, an encoding for the Russian 
language, either place
 supply @samp{-m koi8-r} as a command-line argument to @code{groff}.  The
 localization file @file{ru.tmac} takes care of this automatically; see
 @ref{Manipulating Hyphenation}.@footnote{KOI8-R code points in the range
-@code{0x80}--@code{0x9F} are not valid input to GNU @command{troff}; see
-@ref{Identifiers}.  This should be no impediment to practical documents,
-as these KOI8-R code points do not encode letters, but box-drawing
-symbols and characters that are better obtained via special character
-escape sequences; see @cite{groff_char@r{(7)}}.}
+@code{0x80}--@code{0x9F} are not valid input to GNU @command{troff};
+recall @ref{Input Format}.
+This restriction should be no impediment to practical documents,
+as these KOI8-R code points do not encode letters,
+but box-drawing symbols and characters
+that are better obtained via special character escape sequences;
+see @cite{groff_char@r{(7)}}.}
 
 @item latin2
 @cindex encoding, input, @w{ISO Latin-2} (@w{8859-2})
@@ -6654,40 +6740,8 @@ newline,
 space,
 or invalid as GNU
 @command{troff}
-input.
-
-@c XXX: We might move this discussion earlier since it is applicable to
-@c troff input in general, and include a reference to the `trin`
-@c request.
-@cindex invalid input characters
-@cindex input characters, invalid
-@cindex characters, invalid input
-@cindex Unicode
-Invalid input characters are a subset of control characters (from the
-sets ``C0 Controls'' and ``C1 Controls'' as Unicode describes them).
-When GNU @code{troff} encounters one in an identifier, it produces a
-warning in category @samp{input} (@pxref{Warnings}).  They are removed
-during interpretation: an identifier @samp{foo}, followed by an invalid
-character and then @samp{bar}, is processed as @samp{foobar}.
-
-Invalid input characters are @code{0x00}, @code{0x0B},
-@code{0x0D}--@code{0x1F}, and
-@code{0x80}--@code{0x9F}.@footnote{Historically, control characters like
-ASCII STX, ETX, and BEL (@key{Control+B}, @key{Control+C}, and
-@key{Control+G}) have been observed in @code{roff} documents,
-particularly in macro packages employing them as delimiters with the
-output comparison operator to try to avoid collisions with the content
-of arbitrary user-supplied parameters (@pxref{Operators in
-Conditionals}).  We discourage this expedient; in GNU @code{troff} it is
-unnecessary (outside of compatibility mode) because delimited arguments
-are parsed at a different input level than the surrounding context.
-@xref{Implementation Differences}.}  Some of these code points are used
-by GNU @code{troff} internally, making it non-trivial to extend the
-program to accept UTF-8 or other encodings that use characters from
-these ranges.@footnote{Consider what happens when a C1 control
-@code{0x80}--@code{0x9F} is necessary as a continuation byte in a UTF-8
-sequence.}
-
+input;
+recall @ref{Input Format}.
 Thus, the identifiers @samp{br}, @samp{PP}, @samp{end-list},
 @samp{ref*normal-print}, @samp{|}, @samp{@@_}, and @samp{!"#$%'()*+,-./}
 are all valid.  Discretion should be exercised to prevent confusion.
@@ -17043,7 +17097,7 @@ or a front end with the
 option to enable unsafe mode.
 
 @code{trf} discards invalid input characters;
-recall @ref{Identifiers}.
+recall @ref{Input Format}.
 
 For @code{cf}, within a diversion, ``completely unprocessed'' means that
 each line of a file to be inserted is handled as if it were preceded by
diff --git a/man/groff.7.man b/man/groff.7.man
index 0141d0677..d935a71ac 100644
--- a/man/groff.7.man
+++ b/man/groff.7.man
@@ -341,11 +341,13 @@ or terminal output.
 .SH "Input format"
 .\" ====================================================================
 .
-Input to GNU
+.\" BEGIN Keep (roughly) parallel with groff.texi node "Input Format".
+Organize input to
+GNU
 .I troff \" GNU
-is organized into lines separated by the Unix newline character
+into lines separated by the Unix newline character
 (U+000A),
-and must be in the character encoding it recognizes:
+using the character encoding it recognizes:
 ISO\~Latin-1 (8859-1).
 .
 We recommend use of ISO\~646:1991\~IRV (US-ASCII)
@@ -354,13 +356,76 @@ of ISO\~10646 (Unicode);
 see
 .MR groff_char @MAN7EXT@ .
 .
-The
-.MR preconv @MAN1EXT@
-preprocessor transforms other encodings,
-including UTF-8,
-to satisfy
-.IR @g@troff 's
-requirements.
+.
+.P
+A subset of control characters
+(from the sets \[lq]C0 Controls\[rq] and \[lq]C1 Controls\[rq]
+as Unicode describes them)
+are invalid as input characters.
+.
+GNU
+.I troff \" GNU
+discards such characters instead of interpreting them.
+(It also emits a warning in category
+\[lq]input\[rq];
+see section \[lq]Warnings\[rq]
+of
+.MR groff @MAN1EXT@ .)
+.
+It processes
+a character sequence \[lq]foo\[rq],
+followed by an invalid
+character and then \[lq]bar\[rq],
+as \[lq]foobar\[rq].
+.
+.
+.P
+Invalid input characters comprise
+.BR 0x00 ,
+.BR 0x0B ,
+.BR 0x0D \[en] 0x1F ,
+and
+.BR 0x80 \[en] 0x9F .
+(Historically,
+control characters like
+ASCII
+STX,
+ETX,
+and
+BEL
+(Control+B,
+Control+C,
+and
+Control+G)
+respectively)
+have been observed in
+.I roff
+documents,
+particularly in macro packages employing them as delimiters
+with the output comparison operator
+to try to avoid collisions with the content
+of arbitrary user-supplied parameters
+(see subsection \[lq]Conditional expressions\[rq] below).
+.
+We discourage this expedient;
+in
+GNU
+.I troff \" GNU
+it is unnecessary
+(outside of compatibility mode)
+because the program parses delimited arguments
+at a different input level than their surrounding context.
+.
+See section \[lq]Miscellaneous\[rq] of
+.MR groff_diff @MAN7EXT@ .)
+.
+GNU
+.I troff \" GNU
+uses some of these code points for internal purposes,
+making non-trivial the extension of the program
+to accept UTF-8
+or other encodings that use characters from these ranges.
+.\" END Keep (roughly) parallel with groff.texi node "Input Format".
 .
 .
 .\" ====================================================================
@@ -1332,44 +1397,6 @@ or invalid as GNU
 input.
 .
 .
-.\" XXX: We might move this discussion earlier since it is applicable to
-.\" troff input in general, and include a reference to the `trin`
-.\" request.
-.P
-Invalid input characters are a subset of control characters
-(from the sets \[lq]C0 Controls\[rq] and \[lq]C1 Controls\[rq] as
-Unicode describes them).
-.
-When
-.I @g@troff
-encounters one in an identifier,
-it produces a warning in category
-.RB \%\[lq] input \[rq]
-(see section \[lq]Warnings\[rq] in
-.MR @g@troff @MAN1EXT@ ).
-.
-They are removed during interpretation:
-an identifier \[lq]foo\[rq],
-followed by an invalid
-character and then \[lq]bar\[rq],
-is processed as \[lq]foobar\[rq].
-.
-.
-.P
-Invalid input characters are
-.BR 0x00 ,
-.BR 0x0B ,
-.BR 0x0D \[en] 0x1F ,
-and
-.BR 0x80 \[en] 0x9F .
-.
-Some of these code points are used by
-.I @g@troff
-internally,
-making it non-trivial to extend the program to accept UTF-8 or other
-encodings that use characters from these ranges.
-.
-.
 .P
 An identifier with a closing bracket (\[lq]]\[rq]) in its name can't be
 accessed with bracket-form escape sequences that expect an identifier as

_______________________________________________
groff-commit mailing list
groff-commit@gnu.org
https://lists.gnu.org/mailman/listinfo/groff-commit

[groff] 17/21: doc/*,man/*: Relocate invalid input material.

Reply via email to

[groff] 17/21: doc/,man/: Relocate invalid input material.