Hi Eli, At 2026-01-31T10:37:59+0200, Eli Zaretskii wrote: > Would you mind describing the OSC 8 escape sequences that can be found > in man pages,
They won't be found in man page _source_ documents. To date, they occur only when grotty(1) formats a man page using the man(7) or mdoc(7) macro languages, hyperlink-producing macros are used in those documents, and grotty has not been directed to use "legacy output format". The OSC 8 escape sequences that grotty produces look as follows. I'm quoting grotty's C++ (of vintage dialect) source code.[1] src/devices/grotty/tty.cpp: #define CSI "\033[" #define OSC8 "\033]8" #define ST "\033\\" // Produce an OSC 8 hyperlink. Given ditroff input of the form: // x X tty: link [URI[ KEY=VALUE] ...] // produce "OSC 8 [;KEY=VALUE];[URI] ST". KEY/VALUE pairs can be // repeated arbitrarily and are separated by colons. Omission of the // URI ends the hyperlink that was begun by specifying it. See // <https://gist.github.com/egmontkob/eb114294efbcd5adb1944c9f3cb5feda>. (That URL is the same one that Per Bothner shared.) The brackets indicate optional character sequences, and spaces are used only for clarity to human readers. Do _not_ expect them in grotty's output (in this context). > and what each one of them means? I'm happy to explain, but beyond from Egmont's "gist" above, ECMA-48 is the controlling authority for the structure of these escape sequences. http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf If there were a semantic convention for "OSC 7" and/or "OSC 8", we'd expect them to follow a similar format as the foregoing. You can expect to see the following: 1. ESC 2. ] These bytes select an "operating system command (OSC)". 3. 8 This byte selects the semantics the operating system command will use. Only convention governs these. I followed Egmont's proposed spec, linked above, as closely as I understood it. 4. ; The semicolon begins, and separates, variable-length data items. I would add subsequent non-semicolon characters to a queue. 5a. If you encounter another semicolon, the contents of the queue represent a key/value pair. Most implementations, to my knowledge, ignore and discard these. It is a mechanism for extension, and I know of no convention yet agreed upon for assigning any meanings to these key/value pairs by any two implementations. If you don't want to discard them, save them and populate your queue anew as in step 4. 5b.1. More likely, you will encounter a string terminator (ST) escape sequence (ESC \). Stop enqueueing characters and treat the contents of the queue, if not empty, as a URL. Conduct whatever validation on this URL you like and/or feel honor-bound to perform. What follows in grotty's output stream is not part of the OSC 8 sequence and constitutes "link text", the material that should support activation as a hyperlink. 5b.2. If the queue is empty, the link text is concluded. If there was no output between the commencement of link text and its termination, the meaning is undefined. I would discard the stored URL. 5c. _Optionally_, you can treat a BEL (C-g) as equivalent to a string terminator. This practice is outside of the ECMA-48 specification, but is sometimes produced by applications targeting "color xterms" of the 1990s written by people who lacked access to, or ignored, ECMA-48, and clunkily implemented SGR support. I would not support this practice for OSC 8; I know of nothing that produces such ill-terminated sequences for its much newer convention. I wouldn't even mention it, except that I fear that some terminal emulator developer who spends more effort on promotional activities than on ensuring code quality will bring it up. Because the string terminator ends the escape sequence, the next bytes you read will fall into one of the following exhaustive categories. 1. the start of an SGR escape sequence (starts with ESC [); 2. the start of an OSC 8 escape sequence--if you are already within link text, the occurrence and therefore nesting of these is undefined, and I would ignore them; 3. Unicode Basic Latin code points minus DEL, plus LF, TAB, and FF, encoded in single bytes;[2] or 4. a UTF-8 multibyte charcter sequence (only if GNU troff's output directed grotty to read the description of the "utf8" device). > That would greatly simplify the task of teaching Emacs to DTRT with > those sequences. I hope this is helpful. Here is a concrete example of a degenerate man page that exercises this feature, formatted by grotty (using groff 1.24.0.rc2). $ printf 'Of course it runs\n.UR https://www.gnu.org/software/emacs\nEmacs\n.UE .\n.pl \\n[nl]u\n' | ~/groff-HEAD/bin/nroff -man | od -c 0000000 O f c o u r s e i t r u n 0000020 s 033 ] 8 ; ; h t t p s : / / w 0000040 w w . g n u . o r g / s o f t w 0000060 a r e / e m a c s 033 \ E m a c s 0000100 033 ] 8 ; ; 033 \ . \n 0000111 You can see here that grotty always marks its empty set of key/value pairs explicitly by putting two semicolons between the "OSC 8"... operation code, if you will...and the URL. This follows Egmont's examples and, to me, it seemed the clearest thing to do from the perspective of someone analyzing or debugging this output. It should also make an interpreter easier to write. If you see two semicolons in sequence after OSC 8, what follows them _must_ be a URL. We can alternatively go behind the man(7) macro package's back and specify OSC 8 contents, including key/value pairs, directly. $ printf 'Of course it runs \\X"tty: link https://www.gnu.org/software/emacs foo=bar baz=qux"Emacs\\X"tty: link".\n.pl \\n[nl]u\n' | ~/groff-HEAD/bin/nroff | od -c 0000000 O f c o u r s e i t r u n 0000020 s 033 ] 8 ; : f o o = b a r : b 0000040 a z = q u x ; h t t p s : / / w 0000060 w w . g n u . o r g / s o f t w 0000100 a r e / e m a c s 033 \ E m a c s 0000120 033 ] 8 ; ; 033 \ . \n 0000131 That first colon need not be in there, but I seem to recall causing grotty to generate it on purpose to advertise the presence of a key/value pair "param" so that lookahead would not be necessary in an interpreter. Egmont's spec doesn't rule out empty "params", and this example thus arguably starts with one. The meaning of an empty "param" is not defined.[3] This shouldn't harm anything, and the URL still works fine with gnome-terminal. > (Pointing to some existing documentation of these sequences is also > okay, provided that this documentation clearly explains the results as > produced by Groff, and the intended meaning of each sequence.) Regards, Branden [1] https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/devices/grotty/tty.cpp?h=1.24.0.rc2#n93 https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/devices/grotty/tty.cpp?h=1.24.0.rc2#n473 [2] grotty produces "hard tabs" (TAB) only if given the `-h` command-line option, and form feeds (FF) only if given the `-f` option, and each of these only as circumstances warrant. See its man page. [3] Nor is a "param" lacking an equals sign. Nor is a mechanism for quoting an equals sign, colon, or semicolon.
signature.asc
Description: PGP signature
