Re: Modify buffering of standard streams via environment variables (not LD_PRELOAD)?

2024-04-28 Thread Zachary Santer
> On Sun, 21 Apr 2024, wrotycz wrote:
> >
> > It seems that it's 'interleaved' when buffer is written to a file or
> > pipe, and because stdout is buffered it waits until buffer is full or
> > flushed, while stderr is not and it doesn't wait and write immediately.
>
> Right; my point was just that stdout and stderr are still separate streams
> (with distinct buffers & buffering modes), even if fd 1 & 2 refer to the
> same pipe.

As I guess I should've expected, the behavior differs between a bash
script and a compiled program.

$ cat ./abc123
#!/bin/bash

printf '%s' 'a' >&2
printf '%s' '1'
printf '%s' 'b' >&2
printf '%s' '2'
printf '%s' 'c' >&2
printf '%s' '3'
printf '\n' >&2
printf '\n'

exit 0;
$ ./abc123
a1b2c3

$ ./abc123 2>&1 | cat
a1b2c3

$ cat ./abc123.c
#include 

int main()
{
  putc('a', stderr);
  putc('1', stdout);
  putc('b', stderr);
  putc('2', stdout);
  putc('c', stderr);
  putc('3', stdout);
  putc('\n', stderr);
  putc('\n', stdout);

  return 0;
}
$ gcc -o abc123.exe abc123.c
$ ./abc123.exe
a1b2c3

$ ./abc123.exe 2>&1 | cat
123
abc
$ stdbuf --output=0 --error=0 -- ./abc123.exe 2>&1 | cat
123
abc
$

I probably shouldn't go around assuming that things are smart. I'll
accept that adding logic to glibc to test if any given set of file
descriptors are pointing to the same file or pipe and ensuring that
anything written to any one of those file descriptors is always
actually written to the stream for the first one, for instance, would
probably be overkill.



Re: Modify buffering of standard streams via environment variables (not LD_PRELOAD)?

2024-04-20 Thread Zachary Santer
On Sat, Apr 20, 2024 at 11:58 AM Carl Edquist  wrote:
>
> On Thu, 18 Apr 2024, Zachary Santer wrote:
> >
> > Finally had a chance to try to build with 'stdbuf --output=L --error=L
> > --' in front of the build script, and it caused some crazy problems.
>
> For what it's worth, when I was trying that out msys2 (since that's what
> you said you were using), I also ran into some very weird errors when just
> trying to export LD_PRELOAD and _STDBUF_O to what stdbuf -oL sets.  It was
> weird because I didn't see issues when just running a command (including
> bash) directly under stdbuf.  I didn't get to the bottom of it though and
> I don't have access to a windows laptop any more to experiment.

This was actually in RHEL 7.

stdbuf --output=L --error=L -- "${@}" 2>&1 |
  tee log-file |
while IFS='' read -r line; do
  # do stuff
done
#

And then obviously the arguments to this script give the command I
want it to run.

> Also I might ask, why are you setting "--error=L" ?
>
> Not that this is the problem you're seeing, but in any case stderr is
> unbuffered by default, and you might mess up the output a bit by line
> buffering it, if it's expecting to output partial lines for progress or
> whatever.

I don't know how buffering works when stdout and stderr get redirected
to the same pipe. You'd think, whatever it is, it would have to be
smart enough to keep them interleaved in the same order they were
printed to in. That in mind, I would assume they both get placed into
the same block buffer by default.



Re: Modify buffering of standard streams via environment variables (not LD_PRELOAD)?

2024-04-19 Thread Zachary Santer
On Fri, Apr 19, 2024 at 8:26 AM Pádraig Brady  wrote:
>
> Perhaps at this stage we should consider stdbuf ubiquitous enough to suffice,
> noting that it's also supported on FreeBSD.

Alternatively, if glibc were modified to act on these hypothetical
environment variables, it would be trivial to have stdbuf simply set
those, to ensure backwards compatibility.

> I'm surprised that the LD_PRELOAD setting is breaking your ada build,
> and it would be interesting to determine the reason for that.

If I had that kind of time...



Re: Modify buffering of standard streams via environment variables (not LD_PRELOAD)?

2024-04-19 Thread Zachary Santer
On Fri, Apr 19, 2024 at 5:32 AM Pádraig Brady  wrote:
>
> env variables are what I proposed 18 years ago now:
> https://sourceware.org/bugzilla/show_bug.cgi?id=2457

And the "resistance to that" from the Red Hat people 24 years ago is
listed on a website that doesn't exist anymore.

If I'm to argue with a guy from 18 years ago...

Ulrich Drepper wrote:
> Hell, no.  Programs expect a certain buffer mode and perhaps would work
> unexpectedly if this changes.  By setting a mode to unbuffered, for instance,
> you can easily DoS a system.  I can think about enough other reasons why this 
> is
> a terrible idea.  Programs explicitly must request a buffering scheme so that 
> it
> matches the way the program uses the stream.

If buffering were set according to the env vars before the program
configures buffers on its end, if it chooses to, then the env vars
have no effect. This is how the stdbuf util works, right now. Would
programs that expect a certain buffer mode not set that mode
explicitly themselves? Are you allowing untrusted users to set env
vars for important daemons or something? How is this a valid concern?

This is specific to the standard streams, 0-2. Buffering of stdout and
stderr is already configured dynamically by libc. If it's going to a
terminal, it's line-buffered. If it's not, it's fully buffered.



Modify buffering of standard streams via environment variables (not LD_PRELOAD)?

2024-04-18 Thread Zachary Santer
Was "RFE: enable buffering on null-terminated data"

On Wed, Mar 20, 2024 at 4:54 AM Carl Edquist  wrote:
>
> However, if stdbuf's magic env vars are exported in your shell (either by
> doing a trick like 'export $(env -i stdbuf -oL env)', or else more simply
> by first starting a new shell with 'stdbuf -oL bash'), then every command
> in your pipelines will start with the new default line-buffered stdout.
> That way your line-items from build.sh should get passed all the way
> through the pipeline as they are produced.

Finally had a chance to try to build with 'stdbuf --output=L --error=L
--' in front of the build script, and it caused some crazy problems. I
was building Ada, though, so pretty good chance that part of the build
chain doesn't link against libc at all.

I got a bunch of
ERROR: ld.so: object '/usr/libexec/coreutils/libstdbuf.so' from
LD_PRELOAD cannot be preloaded: ignored.

And then it somehow caused compiler errors relating to the size of
what would be pointer types. Cleared out all the build products and
tried again without stdbuf and everything was fine.

>From the original thread just within the coreutils email list, "stdbuf
feature request - line buffering but for null-terminated data":
On Tue, Mar 12, 2024 at 12:42 PM Kaz Kylheku  wrote:
>
> I would say that if it is implemented, the programs which require
> it should all make provisions to set it up themselves.
>
> stdbuf is a hack/workaround for programs that ignore the
> issue of buffering. Specifically, programs which send information
> to one of the three standard streams, such that the information
> is required in a timely way.  Those streams become fully buffered
> when not connected to a terminal.

I think I've partially come around to this point of view. However,
instead of expecting all sorts of individual programs to implement
their own buffering mode command-line options, could this be handled
with environment variables, but without LD_PRELOAD? I don't know if
libc itself can check for those environment variables and adjust each
program's buffering on its own, but if so, that would be a much
simpler solution.

You could compare this to the various locale environment variables,
though I think a lot of commands whose behavior differ from locale to
locale do have to implement their own handling of that internally, at
least to some extent.

This seems like somewhat less of a hack, and if no part of a program
looks for those environment variables, it isn't going to find itself
getting broken by the dynamic linker. It's just not going to change
its buffering.

Additionally, things that don't link against libc could still honor
these environment variables, if the developers behind them care to put
in the effort.

Zack



Re: RFE: enable buffering on null-terminated data

2024-03-19 Thread Zachary Santer
On Tue, Mar 19, 2024 at 1:24 AM Kaz Kylheku  wrote:
>
> But what tee does is set up _IONBF on its output streams,
> including stdout.

So it doesn't buffer at all. Awesome. Nevermind.



Re: RFE: enable buffering on null-terminated data

2024-03-17 Thread Zachary Santer
On Thu, Mar 14, 2024 at 11:14 AM Carl Edquist  wrote:

> Where things get sloppy is if you add some stuff in a pipeline after your
> build script, which results in things getting block-buffered along the
> way:
>
> $ ./build.sh | sed s/what/ever/ | tee build.log
>
> And there you will definitely see a difference.

Sadly, the man page for stdbuf specifically calls out tee as being
unaffected by stdbuf, because it adjusts the buffering of its standard
streams itself. The script I mentioned pipes everything through tee,
and I don't think I'm willing to refactor it not to. Ah well.

> Oh, I imagine "undefined operation" means something more like
> "unspecified" here.  stdbuf(1) uses setbuf(3), so the behavior you'll get
> should be whatever the setbuf(3) from the libc on your system does.
>
> I think all this means is that the C/POSIX standards are a bit loose about
> what is required of setbuf(3) when a buffer size is specified, and there
> is room in the standard for it to be interpreted as only a hint.

> Works for me (on glibc-2.23)

Thanks for setting me straight here.

> What may not be obvious is that the shell does not need to get involved
> with writing input for a coprocess or reading its output - the shell can
> start other (very fast) programs with input/output redirected to/from the
> coprocess pipes to do that processing.

Gosh, I'd like to see an example of that, too.

> My point though earlier was that a null-terminated record buffering mode,
> as useful as it sounds on the surface (for null-terminated paths), may
> actually be something _nobody_ has ever actually needed for an actual (not
> contrived) workflow.

I considered how it seemed like something people could need years ago
and only thought to email into email lists about it last weekend.
Maybe there are all sorts of people out there who have been using
'stdbuf --output=0' on null-terminated data for years and never
thought to raise the issue. I know that's not a very strong argument,
though.



Re: stdbuf feature request - line buffering but for null-terminated data

2024-03-12 Thread Zachary Santer
On Tue, Mar 12, 2024 at 12:42 PM Kaz Kylheku  wrote:
> stdbuf is a hack/workaround for programs that ignore the
> issue of buffering. Specifically, programs which send information
> to one of the three standard streams, such that the information
> is required in a timely way.  Those streams become fully buffered
> when not connected to a terminal.

When we're talking about very simple programs, like expand, stdbuf is
probably the best solution we're ever going to actually get.

> There can be a performance issue also, though! Suppose
> we run "find" to find certain files over a large file tree.
> It finds only a small number of files: all the file paths
> identified fit into a single buffer, which is not flushed
> until the program terminates (when sent to a pipe).
>
> We pipe this to some program which does some processing
> on those files. We would like the processing to start as
> soon as the first file has been identified, not when find is done!
> It could be that find discovers all the relevant files
> early in its execution and then spends a minute finding
> nothing else. That minute is added to the processing time
> of the files that were found.
>
> That is the compelling reason for wanting file names to
> be flushed individually, whether they are newline terminated
> or null terminated.

An ideal solution for this situation, from the perspective of a
relative layperson, would be to flush a sized buffer after a given
time period of containing data but having not been flushed. So, if a
buffer gets filled very quickly, it just gets flushed upon being
filled. If data sits in the buffer for a few too many processor cycles
or what have you, it gets flushed right then. I imagine there would be
some overhead to implementing that, which I don't have a good feel
for.



Re: stdbuf feature request - line buffering but for null-terminated data

2024-03-12 Thread Zachary Santer
On Tue, Mar 12, 2024 at 2:58 PM Kaz Kylheku  wrote:
> What if there existed an alternative delimiting mode: a format where
> the character strings are delimited by the two byte sequence \0\n.

How long did it take for the major command-line utilities to initially
implement handling null-terminated data? I submitted a feature request
to the pcre2 maintainer to implement printing null-terminated
filenames from pcre2grep, just back in July of 2022. To his credit,
that got done quickly, but that version of the library still missed
getting into RHEL 9, unless they've updated it since I've looked.
Furthermore, there's little consistency from utility to utility in
what the flag to specify null-delimited data is. Now you're asking for
a whole lot more of that.

> 1. It now works with line buffering.
>
> 2. Assuming \0 is just an invisible character consumed by terminals with no
>effect, this format can be dumped to a TTY where it turns into
>lines, as if the nulls were not there.

"tr '\0' '\n'" at the end of a pipeline isn't the end of the world.

> 3. Portability: doesn't require a new buffering mode that would only
>be initially supported in Glibc, and likely never spread beyond
>a handful of freeware C libraries.

I've got a conversation going with the glibc people, that this list is
cc'd on, but who knows if it goes anywhere. In any case, if it's a
choice between unbuffered stdout and a whole new data delimiting
sequence that now every utility has to support, unbuffered stdout is
going to be the answer.



Re: RFE: enable buffering on null-terminated data

2024-03-11 Thread Zachary Santer
On Mon, Mar 11, 2024 at 7:54 AM Carl Edquist  wrote:
>
> (In my coprocess management library, I effectively run every coproc with
> --output=L by default, by eval'ing the output of 'env -i stdbuf -oL env',
> because most of the time for a coprocess, that's whats wanted/necessary.)

Surrounded by 'set -a' and 'set +a', I guess? Now that's interesting.
I just added that to a script I have that prints lines output by
another command that it runs, generally a build script, to the command
line, but updating the same line over and over again. I want to see if
it updates more continuously like that.

> ... Although, for your example coprocess use, where the shell both
> produces the input for the coproc and consumes its output, you might be
> able to simplify things by making the producer and consumer separate
> processes.  Then you could do a simpler 'producer | filter | consumer'
> without having to worry about buffering at all.  But if the producer and
> consumer need to be in the same process (eg they share state and are
> logically interdependent), then yeah that's where you need a coprocess for
> the filter.

Yeah, there's really no way to break what I'm doing into a standard pipeline.

> (Although given your time output, you might say the performance hit for
> unbuffered is not that huge.)

We see a somewhat bigger difference, at least proportionally, if we
get bash more or less out of the way. See command-buffering, attached.

Standard:
real0m0.202s
user0m0.280s
sys 0m0.076s
Line-buffered:
real0m0.497s
user0m0.374s
sys 0m0.545s
Unbuffered:
real0m0.648s
user0m0.544s
sys 0m0.702s

In coproc-buffering, unbuffered output was 21.7% slower than
line-buffered output, whereas here it's 30.4% slower.

Of course, using line-buffered or unbuffered output in this situation
makes no sense. Where it might be useful in a pipeline is when an
earlier command in a pipeline might only print things occasionally,
and you want those things transformed and printed to the command line
immediately.

> So ... again in theory I also feel like a null-terminated buffering mode
> for stdbuf(1) (and setbuf(3)) is kind of a missing feature.

My assumption is that line-buffering through setbuf(3) was implemented
for printing to the command line, so its availability to stdbuf(1) is
just a useful side effect.

In the BUGS section in the man page for stdbuf(1), we see:
On GLIBC platforms, specifying a buffer size, i.e., using fully
buffered mode will result in undefined operation.

If I'm not mistaken, then buffer modes other than 0 and L don't
actually work. Maybe I should count my blessings here. I don't know
what's going on in the background that would explain glibc not
supporting any of that, or stdbuf(1) implementing features that aren't
supported on the vast majority of systems where it will be installed.

> It may just
> be that nobody has actually had a real need for it.  (Yet?)

I imagine if anybody has, they just set --output=0 and moved on. Bash
scripts aren't the fastest thing in the world, anyway.


command-buffering
Description: Binary data


Re: RFE: enable buffering on null-terminated data

2024-03-10 Thread Zachary Santer
On Sun, Mar 10, 2024 at 4:36 PM Carl Edquist  wrote:
>
> Hi Zack,
>
> This sounds like a potentially useful feature (it'd probably belong with a
> corresponding new buffer mode in setbuf(3)) ...
>
> > Filenames should be passed between utilities in a null-terminated
> > fashion, because the null byte is the only byte that can't appear within
> > one.
>
> Out of curiosity, do you have an example command line for your use case?

My use for 'stdbuf --output=L' is to be able to run a command within a
bash coprocess. (Really, a background process communicating with the
parent process through FIFOs, since Bash prints a warning message if
you try to run more than one coprocess at a time. Shouldn't make a
difference here.) See coproc-buffering, attached. Without making the
command's output either line-buffered or unbuffered, what I'm doing
there would deadlock. I feed one line in and then expect to be able to
read a transformed line immediately. If that transformed line is stuck
in a buffer that's still waiting to be filled, then nothing happens.

I swear doing this actually makes sense in my application.

$ ./coproc-buffering 10
Line-buffered:
real0m17.795s
user0m6.234s
sys 0m11.469s
Unbuffered:
real0m21.656s
user0m6.609s
sys 0m14.906s

When I initially implemented this thing, I felt lucky that the data I
was passing in were lines ending in newlines, and not null-terminated,
since my script gets to benefit from 'stdbuf --output=L'. Truth be
told, I don't currently have a need for --output=N. Of course, sed and
all sorts of other Linux command-line tools can produce or handle
null-terminated data.

> > If I want to buffer output data on null bytes, the closest I can get is
> > 'stdbuf --output=0', which doesn't buffer at all. This is pretty
> > inefficient.
>
> I'm just thinking that find(1), for instance, will end up calling write(2)
> exactly once per filename (-print or -print0) if run under stdbuf
> unbuffered, which is the same as you'd get with a corresponding stdbuf
> line-buffered mode (newline or null-terminated).
>
> It seems that where line buffering improves performance over unbuffered is
> when there are several calls to (for example) printf(3) in constructing a
> single line.  find(1), and some filters like grep(1), will write a line at
> a time in unbuffered mode, and thus don't seem to benefit at all from line
> buffering.  On the other hand, cut(1) appears to putchar(3) a byte at a
> time, which in unbuffered mode will (like you say) be pretty inefficient.
>
> So, depending on your use case, a new null-terminated line buffered option
> may or may not actually improve efficiency over unbuffered mode.

I hadn't considered that.

> You can run your commands under strace like
>
>  stdbuf --output=X  strace -c -ewrite  command ... | ...
>
> to count the number of actual writes for each buffering mode.

I'm running bash in MSYS2 on a Windows machine, so hopefully that
doesn't invalidate any assumptions. Now setting up strace around the
things within the coprocess, and only passing in one line, I now have
coproc-buffering-strace, attached. Giving the argument 'L', both sed
and expand call write() once. Giving the argument 0, sed calls write()
twice and expand calls it a bunch of times, seemingly once for each
character it outputs. So I guess that's it.

$ ./coproc-buffering-strace L
|Line with tabs   why?|

$ grep -c -F 'write:' sed-trace.txt expand-trace.txt
sed-trace.txt:1
expand-trace.txt:1

$ ./coproc-buffering-strace 0
|Line with tabs   why?|

$ grep -c -F 'write:' sed-trace.txt expand-trace.txt
sed-trace.txt:2
expand-trace.txt:30

> Carl
>
>
> PS, "find -printf" recognizes a '\c' escape to flush the output, in case
> that helps.  So "find -printf '%p\0\c'" would, for instance, already
> behave the same as "stdbuf --output=N  find -print0" with the new stdbuf
> output mode you're suggesting.
>
> (Though again, this doesn't actually seem to be any more efficient than
> running "stdbuf --output=0  find -print0")
>
> On Sun, 10 Mar 2024, Zachary Santer wrote:
>
> > Was "stdbuf feature request - line buffering but for null-terminated data"
> >
> > See below.
> >
> > On Sun, Mar 10, 2024 at 5:38 AM Pádraig Brady  wrote:
> >>
> >> On 09/03/2024 16:30, Zachary Santer wrote:
> >>> 'stdbuf --output=L' will line-buffer the command's output stream.
> >>> Pretty useful, but that's looking for newlines. Filenames should be
> >>> passed between utilities in a null-terminated fashion, because the
> >>> null byte is the only byte that can't appear within one.
> >>>
> >>> If I want to buffer output data on null bytes, the closest I can get

RFE: enable buffering on null-terminated data

2024-03-10 Thread Zachary Santer
Was "stdbuf feature request - line buffering but for null-terminated data"

See below.

On Sun, Mar 10, 2024 at 5:38 AM Pádraig Brady  wrote:
>
> On 09/03/2024 16:30, Zachary Santer wrote:
> > 'stdbuf --output=L' will line-buffer the command's output stream.
> > Pretty useful, but that's looking for newlines. Filenames should be
> > passed between utilities in a null-terminated fashion, because the
> > null byte is the only byte that can't appear within one.
> >
> > If I want to buffer output data on null bytes, the closest I can get
> > is 'stdbuf --output=0', which doesn't buffer at all. This is pretty
> > inefficient.
> >
> > 0 means unbuffered, and Z is already taken for, I guess, zebibytes.
> > --output=N, then?
> >
> > Would this require a change to libc implementations, or is it possible now?
>
> This does seem like useful functionality,
> but it would require support for libc implementations first.
>
> cheers,
> Pádraig



stdbuf feature request - line buffering but for null-terminated data

2024-03-09 Thread Zachary Santer
'stdbuf --output=L' will line-buffer the command's output stream.
Pretty useful, but that's looking for newlines. Filenames should be
passed between utilities in a null-terminated fashion, because the
null byte is the only byte that can't appear within one.

If I want to buffer output data on null bytes, the closest I can get
is 'stdbuf --output=0', which doesn't buffer at all. This is pretty
inefficient.

0 means unbuffered, and Z is already taken for, I guess, zebibytes.
--output=N, then?

Would this require a change to libc implementations, or is it possible now?

- Zack



Re: [PATCH] printf: add %#s alias to %b

2023-09-07 Thread Zachary Santer
On Thu, Sep 7, 2023 at 12:55 PM Robert Elz  wrote:

> There are none, printf(3) belongs to the C committee, and they can make
> use of anything they like, at any time they like.
>
> The best we can do is use formats that make no sense for printf(1) to
> support
>

That's still assuming the goal of minimizing the discrepancies between
printf(1) and printf(3) format specifiers. As you point out, that isn't
particularly useful, and these things diverging further is now a foregone
conclusion. The only benefit, from my perspective, is allowing the
printf(1) man page to simply reference the printf(3) man page for
everything that printf(1) attempts to replicate.

Zack


Re: [PATCH] printf: add %#s alias to %b

2023-09-07 Thread Zachary Santer
The trouble with using an option flag to printf(1) to toggle the meaning of
%b is that you can't then mix format specifiers for binary literals and
backslash escape expansion within the same format string. You'd just have
to call printf(1) multiple times, which largely defeats the purpose of a
format string.

I don't know what potential uppercase/lowercase pairs of format specifiers
are free from use in any existing POSIX-like shell, but my suggestion would
be settling on one of those to take on the meaning of C2x's %b. They'd
still print '0b' or '0B' in the resulting binary literal when given the #
flag, which might be a little confusing, but this seems like the safest way
to go. It obviously still represents a divergence from C2x's printf(3), but
I think the consensus is that that's going to happen regardless.

ksh's format specifiers for arbitrary-base integer representation sound
really slick, and I'd love to see that in Bash, too, actually.

Zack