m4 infers a newline in this case:
    echo -n dnl | m4
    m4:stdin:1: Warning: end of file treated as newline

It also objects to a file with an unterminated quote
    echo \` | m4 - <(echo \')
    m4:stdin:1: ERROR: end of file in string

and objects similarly to an incomplete argument list
    echo "define(x,y)x("  | m4 - <(echo \) )
    m4:file1:1: ERROR: end of file in argument list

> Would you like to submit a patch to lift the artificial limitation of
> splitting tokens at a file boundary?

Ideally a patch would make the results of these two commands agree:
    m4 file1 file2
    cat file1 file2 | m4

I suggest that two patches be made simultaneously: make m4
handle multiple files as if they had been combined by cat, and make
non-Posix input an error.

Requiring Posix input would break the behavior of non-newline-terminated
files, but with a diagnostic that pinpoints the necessary adjustment of
the input. It should often be sufficient to append `'dnl\n. (Note the quotes).

Requiring Posix input would also break the behavior of files that contain
NUL. Hopefully tr -d '\0' will cure most cases.

Finally, requiring Posix input could hide the ERROR diagnostics mentioned
above. But I don't think it's necessary to support the portability of
fatal errors.

Doug





On Sat, Dec 7, 2024 at 11:02 PM Eric Blake <ebl...@redhat.com> wrote:
>
> On Fri, Dec 06, 2024 at 05:33:09PM -0500, Douglas McIlroy wrote:
> > Arguments are collected if a macro name is "immediately" followed by a
> > left parenthesis. Experiment shows that arguments are not collected
> > when a macro name occurs at the end of a file (without a following
> > newline) and the next input file begins with a left parenthesis. I
> > believe this behavior is incorrect.
> >
>
> GNU m4 is not injecting a missing newline, so much as it's input
> scanner refuses to recognize tokens that would extend across file
> boundaries.  And, as you correctly noted, POSIX says the behavior is
> undefined if the input lacks an ending newline, so there we are free
> to make it do whatever is more useful.
>
> It's not just file boundaries: at least m4wrap() behaves the same way,
> by injecting an artificial end-of-token boundary between the first and
> second layers of nested unwrapping. Consider this example:
>
> echo 'changequote([,])define([ab],[AB])m4wrap(b)m4wrap(a)' | m4
>
> which outputs "\nAB", but:
>
> $ echo 'changequote([,])define([ab],[AB])m4wrap([m4wrap(b)a])' | m4
>
> changes the output to "\nab".  What's going on?  Under the hood, the
> output "a" from one m4wrap is concatenated with whatever text is next
> in the input stream from any other LIFO m4wrap()s; so the first
> example shows a and b adjacent after the two m4wrap'd text strings are
> replayed in reverse order turning into a macro expansion.  But in the
> second example, even though there are no intervening characters in the
> output stream, there WAS an intervention - m4 reached the
> "end-of-file" of the first layer of m4wrap, and proceeded to dive into
> the second layer of m4wrap, which was a stronger boundary than two
> wraps at the same nesting layer.
>
> But one of the powers of m4 is the ability to create macro names by
> concatenation on rescan.  It seems odd that end-of-file should
> interfere with what is normally m4's strong point of rescanning
> unquoted output to see if new macro calls appear.
>
> Would you like to submit a patch to lift the artificial limitation of
> splitting tokens at a file boundary?  Should that be done
> unconditionally, or gated somehow (preferably with the default
> behavior as it has always been, where you have to opt in to the new
> behavior)?  And if gated, would it be by a new command-line argument,
> or would it be something you can toggle on or off at will while m4 is
> running, or both?  And does such concatenation work across frozen
> files?  Should there be limits on what can be concatenated?  You
> mention "macro" and "(args)" turning into a call of macro(), but would
> "def" and "ine(args)" turn into a call of define(args); or what about
> "changequote(<<,>>)<" and "<is this quoted>>"?  Do you really want
> file B to behave differently because of whatever was left at the end
> of file A?  Thus, my gut feel is that m4 is unlikely to change from
> its current behavior, because the design costs outweigh the
> corner-case benefits.
>
> > However, the underlying error lies not with m4, but with the input
> > file. According to POSIX, a valid nonempty text file must end with a
> > newline. Other experiment suggests that m4 silently appends a missing
> > newline. Should it not warn when it does so?
>
> First off, it _is_ possible to define a macro that warns if it is
> invoked with 0 arguments (ifelse on $# and errprint are handy).  But
> that doesn't help you if you want to handle an arbtrary word, rather
> than a known macro name, at the end of the file.  And that doesn't
> help with your desire to put the macro name in one file and its
> (arguments) in another.
>
> Meanwhile, it _is_ a feature of GNU m4 that if the input lacks the
> trailing newline, it will produce output that also lacks the trailing
> newline (POSIX doesn't say whether that makes any sense - it says
> portable use of m4 is limited to the input to being a text file, but
> not whether the output should be a text file; but that means it is
> probably not portable to try to rely on that behavior).  So if we
> started warning because of your situation, we might break others who
> have come to rely on that extension behavior.  It's also possible that
> even when your m4 file ends in a newline, you ended it with " dnl\n"
> or had some use of m4wrap that does not itself produce a newline, so
> that the output produced is NOT a text file, according to POSIX; POSIX
> does not say whether that is well-defined behavior, but I suspect
> there may be some non-GNU m4 implementations that supply a trailing
> newline for a file when your m4 program doesn't, even though GNU m4
> doesn't supply that newline.
>
> And lest you think that using m4 to produce output without a trailing
> newline is the only worry, there are other ways in which you can
> produce a non-text-file output: use of include, syscmd, or undivert to
> produce NUL bytes in the output stream (even though GNU m4 itself may
> not handle them gracefully), producing lines longer than the platforms
> line-buffer limits (although GNU systems try not to have such limits,
> there are other systems which do, and m4 recursion makes it very easy
> to write something that expands to a large length at the expense of
> some CPU churning), or encoding errors (GNU m4 is not yet upgraded to
> the Unicode world, so it will happily break bytes apart and mangle
> UTF-8 into mojibake if you are not careful).
>
> The rest of my mail is a bit of a tangent: I've long wished that I had
> a way to read in a file containing a blob of arbitrary text, and
> sanitize it (with translit or patsubst) for further safe processing in
> m4; my ideal way would be doing something like:
>
> $ tail -n2 A_head.m4
> dnl ...any other setup code above
> changequote(`````,''''')define(`data',`````
> $ head -n2 A_tail.m4
> ''''')changequote`'dnl now defn(`data') can access that blob as a
> dnl single argument of whatever file(s) are passed in the middle
> $ m4 A_head.m4 your_file_here A_tail.m4
>
> where you can pick whatever complicated changequote needed so that you
> don't have to worry about the contents of your_file_here inadvertantly
> triggering m4 syntax.  I can't quite do that with the include builtin,
> where the file you just parsed in _will_ be parsed as m4, rather than
> raw data.  But GNU m4's current refusal to allow concatenation across
> file boundaries means that it errors out on an unterminated string
> instead of letting my theoretical trick work, just the same as it
> prevents you from continuing a macro name or its parameter list across
> the file boundary.
>
> You _can_ do tricks with changecom to do roughly the same: if you know
> what the first few bytes of the file are, you can set up a changecom
> that starts with those first few bytes and ends with a long sequence
> unlikely to be in the middle of the included file - at which point you
> can now treat the entire input file as a single m4 comment which
> undergoes no further expansion, then trim off your suffix sequence as
> you sanitize the data.  But then you are at the issue of how you
> detect those first few bytes; perhaps syscmd or esyscmd can be used
> for that, although it's another set of interesting language barriers
> when you try to write m4 that produces valid shell code for grabbing
> untrusted bytes in a sanitized manner.
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.
> Virtualization:  qemu.org | libguestfs.org
>

Reply via email to