m4 infers a newline in this case: echo -n dnl | m4 m4:stdin:1: Warning: end of file treated as newline
It also objects to a file with an unterminated quote echo \` | m4 - <(echo \') m4:stdin:1: ERROR: end of file in string and objects similarly to an incomplete argument list echo "define(x,y)x(" | m4 - <(echo \) ) m4:file1:1: ERROR: end of file in argument list > Would you like to submit a patch to lift the artificial limitation of > splitting tokens at a file boundary? Ideally a patch would make the results of these two commands agree: m4 file1 file2 cat file1 file2 | m4 I suggest that two patches be made simultaneously: make m4 handle multiple files as if they had been combined by cat, and make non-Posix input an error. Requiring Posix input would break the behavior of non-newline-terminated files, but with a diagnostic that pinpoints the necessary adjustment of the input. It should often be sufficient to append `'dnl\n. (Note the quotes). Requiring Posix input would also break the behavior of files that contain NUL. Hopefully tr -d '\0' will cure most cases. Finally, requiring Posix input could hide the ERROR diagnostics mentioned above. But I don't think it's necessary to support the portability of fatal errors. Doug On Sat, Dec 7, 2024 at 11:02 PM Eric Blake <ebl...@redhat.com> wrote: > > On Fri, Dec 06, 2024 at 05:33:09PM -0500, Douglas McIlroy wrote: > > Arguments are collected if a macro name is "immediately" followed by a > > left parenthesis. Experiment shows that arguments are not collected > > when a macro name occurs at the end of a file (without a following > > newline) and the next input file begins with a left parenthesis. I > > believe this behavior is incorrect. > > > > GNU m4 is not injecting a missing newline, so much as it's input > scanner refuses to recognize tokens that would extend across file > boundaries. And, as you correctly noted, POSIX says the behavior is > undefined if the input lacks an ending newline, so there we are free > to make it do whatever is more useful. > > It's not just file boundaries: at least m4wrap() behaves the same way, > by injecting an artificial end-of-token boundary between the first and > second layers of nested unwrapping. Consider this example: > > echo 'changequote([,])define([ab],[AB])m4wrap(b)m4wrap(a)' | m4 > > which outputs "\nAB", but: > > $ echo 'changequote([,])define([ab],[AB])m4wrap([m4wrap(b)a])' | m4 > > changes the output to "\nab". What's going on? Under the hood, the > output "a" from one m4wrap is concatenated with whatever text is next > in the input stream from any other LIFO m4wrap()s; so the first > example shows a and b adjacent after the two m4wrap'd text strings are > replayed in reverse order turning into a macro expansion. But in the > second example, even though there are no intervening characters in the > output stream, there WAS an intervention - m4 reached the > "end-of-file" of the first layer of m4wrap, and proceeded to dive into > the second layer of m4wrap, which was a stronger boundary than two > wraps at the same nesting layer. > > But one of the powers of m4 is the ability to create macro names by > concatenation on rescan. It seems odd that end-of-file should > interfere with what is normally m4's strong point of rescanning > unquoted output to see if new macro calls appear. > > Would you like to submit a patch to lift the artificial limitation of > splitting tokens at a file boundary? Should that be done > unconditionally, or gated somehow (preferably with the default > behavior as it has always been, where you have to opt in to the new > behavior)? And if gated, would it be by a new command-line argument, > or would it be something you can toggle on or off at will while m4 is > running, or both? And does such concatenation work across frozen > files? Should there be limits on what can be concatenated? You > mention "macro" and "(args)" turning into a call of macro(), but would > "def" and "ine(args)" turn into a call of define(args); or what about > "changequote(<<,>>)<" and "<is this quoted>>"? Do you really want > file B to behave differently because of whatever was left at the end > of file A? Thus, my gut feel is that m4 is unlikely to change from > its current behavior, because the design costs outweigh the > corner-case benefits. > > > However, the underlying error lies not with m4, but with the input > > file. According to POSIX, a valid nonempty text file must end with a > > newline. Other experiment suggests that m4 silently appends a missing > > newline. Should it not warn when it does so? > > First off, it _is_ possible to define a macro that warns if it is > invoked with 0 arguments (ifelse on $# and errprint are handy). But > that doesn't help you if you want to handle an arbtrary word, rather > than a known macro name, at the end of the file. And that doesn't > help with your desire to put the macro name in one file and its > (arguments) in another. > > Meanwhile, it _is_ a feature of GNU m4 that if the input lacks the > trailing newline, it will produce output that also lacks the trailing > newline (POSIX doesn't say whether that makes any sense - it says > portable use of m4 is limited to the input to being a text file, but > not whether the output should be a text file; but that means it is > probably not portable to try to rely on that behavior). So if we > started warning because of your situation, we might break others who > have come to rely on that extension behavior. It's also possible that > even when your m4 file ends in a newline, you ended it with " dnl\n" > or had some use of m4wrap that does not itself produce a newline, so > that the output produced is NOT a text file, according to POSIX; POSIX > does not say whether that is well-defined behavior, but I suspect > there may be some non-GNU m4 implementations that supply a trailing > newline for a file when your m4 program doesn't, even though GNU m4 > doesn't supply that newline. > > And lest you think that using m4 to produce output without a trailing > newline is the only worry, there are other ways in which you can > produce a non-text-file output: use of include, syscmd, or undivert to > produce NUL bytes in the output stream (even though GNU m4 itself may > not handle them gracefully), producing lines longer than the platforms > line-buffer limits (although GNU systems try not to have such limits, > there are other systems which do, and m4 recursion makes it very easy > to write something that expands to a large length at the expense of > some CPU churning), or encoding errors (GNU m4 is not yet upgraded to > the Unicode world, so it will happily break bytes apart and mangle > UTF-8 into mojibake if you are not careful). > > The rest of my mail is a bit of a tangent: I've long wished that I had > a way to read in a file containing a blob of arbitrary text, and > sanitize it (with translit or patsubst) for further safe processing in > m4; my ideal way would be doing something like: > > $ tail -n2 A_head.m4 > dnl ...any other setup code above > changequote(`````,''''')define(`data',````` > $ head -n2 A_tail.m4 > ''''')changequote`'dnl now defn(`data') can access that blob as a > dnl single argument of whatever file(s) are passed in the middle > $ m4 A_head.m4 your_file_here A_tail.m4 > > where you can pick whatever complicated changequote needed so that you > don't have to worry about the contents of your_file_here inadvertantly > triggering m4 syntax. I can't quite do that with the include builtin, > where the file you just parsed in _will_ be parsed as m4, rather than > raw data. But GNU m4's current refusal to allow concatenation across > file boundaries means that it errors out on an unterminated string > instead of letting my theoretical trick work, just the same as it > prevents you from continuing a macro name or its parameter list across > the file boundary. > > You _can_ do tricks with changecom to do roughly the same: if you know > what the first few bytes of the file are, you can set up a changecom > that starts with those first few bytes and ends with a long sequence > unlikely to be in the middle of the included file - at which point you > can now treat the entire input file as a single m4 comment which > undergoes no further expansion, then trim off your suffix sequence as > you sanitize the data. But then you are at the issue of how you > detect those first few bytes; perhaps syscmd or esyscmd can be used > for that, although it's another set of interesting language barriers > when you try to write m4 that produces valid shell code for grabbing > untrusted bytes in a sanitized manner. > > -- > Eric Blake, Principal Software Engineer > Red Hat, Inc. > Virtualization: qemu.org | libguestfs.org >