On Fri, Apr 16, 2021 at 02:26:27AM -0700, Jordan Geoghegan wrote:

> 
> 
> On 4/15/21 7:49 AM, Otto Moerbeek wrote:
> > On Thu, Apr 15, 2021 at 04:29:17PM +0200, Christian Weisgerber wrote:
> >
> >> Jordan Geoghegan:
> >>
> >>> --- /tmp/bad.txt  Wed Apr 14 21:06:51 2021
> >>> +++ /tmp/good.txt  Wed Apr 14 21:06:41 2021
> >> I'll note that no characters have been lost between the two files.
> >> Only the order is different.
> >>
> >>> The only thing that changed between these runs was me using either xargs 
> >>> -P 1 or -P 2.
> >> What do you expect?  You run two processes in parallel that write
> >> to the same file.  Obviously their output will be interspersed in
> >> unpredictable order.
> >>
> >> You seem to imagine that awk's output is line-buffered.  But when
> >> it writes to a pipe or file, its output is block-buffered.  This
> >> is default stdio behavior.  Output is written in block-size increments
> >> (16 kB in practice) without regard to lines.  So, yes, you can end
> >> up with a fragment from a line written by process #1, followed by
> >> lines from process #2, followed by the remainder of the line from
> >> #1, etc.
> >>
> >> -- 
> >> Christian "naddy" Weisgerber                          [email protected]
> >>
> > Right, a fflush() call after the printf makes the issue go away, but
> > only since awk is being nice and issues a single write call for that
> > single printf. Since awk afaik does not give such a guarantee, it is
> > better to have each parallel invocation write to a separate file and
> > then cat them together after all the awk runs are done.
> >
> >     -Otto
> 
> Hello Christian and Otto,
> 
> Thank you for setting me straight. The block vs line buffering issue should 
> have been obvious to me. What got me confused was that this solution worked 
> well, for a long time - until it didn't. One would assume that it would 
> consistently mangle output...

Buffering issues depend on the (size of) the data being written. I
think it is pretty consistent: if the bugs appears it always does in
the same way.

> 
> While fflush does seem to fix the issue, I wanted to explore your suggestion 
> Otto of writing to a temporary file from within awk.
> 
> Is something like the following a sane approach to safely generating 
> temporary files from within awk?:
> 
> BEGIN{ cmd = "mktemp -q /tmp/workdir/tmp.XXXXXXX" ; if( ( cmd | getline 
> result ) > 0 ) TMPFILE = result ; else exit 1 }
> 
> Unless I'm missing something obvious, It seems there is no way to capture 
> both the stdout and return code of an external command from within awk. My 
> workaround solution to error check the call to mktemp here is to abort if 
> mktemp returns no data. Is this sane?
> 
> Regards,
> 
> Jordan

I think that would work, but maybe it is nicer to wrap the code in a
shell script that generates tmp file names, passes the names to awk
and then do the catting of the result files in the shell script? To
run the cat command you need to know the names of the files anayway.

        -Otto

Reply via email to