Re: utilities and write errors

Robert Elz via austin-group-l at The Open Group Sat, 24 Jul 2021 19:17:23 -0700

    Date:        Thu, 15 Jul 2021 10:19:17 +0100
    From:        "Geoff Clare via austin-group-l at The Open Group" 
<austin-group-l@opengroup.org>
    Message-ID:  <20210715091917.GA13523@localhost>


Sorry, I've had other (more useful) things to do than deal with this...

  | You are looking at the wrong EXIT STATUS wording.  It is the wording
  | for exit status 0 that is different.
  |
  | cd:
  |
  |     0 The directory was successfully changed.
  |
  | pwd:
  |
  |     0 Successful completion.
  |
  | See the difference now?

Sure, there's a difference, but it all either means nothing, or leans
towards my interpretation of the standard.

For pwd "0 Successful completion" has never been in doubt, the question
has always been "what is required of pwd for it to be considered successful?"
That's where we started.

You contend that it must successfully write to standard output, because,
well, just because nothing else seems sane, if I understand what you've
been saying all this time.

I say that pwd must write to standard output, but that nothing says that
that write must actually succeed - because there simply is no text that
requires that (if there were, someone would have quoted it by now).

You argue that that makes no sense, as if no output appears, pwd hasn't
done anything useful - I say that that might be true, but that's how it
always has been, and the standard is supposed to document what the industry
practice is, not what you (or even I) might want it to be.   You believe
that this was intentionally changed, but have provided no evidence at all
to support that contention (and nothing to justify doing such a thing, if
it had been done).

This all gets even more obvious when we consider the exit status for cd...

82503 0 The current working directory was successfully changed and the value of 
the PWD
82504   environment variable was set correctly.

82508 >0 Either the -e option or the -P option is not in effect, and an error 
occurred.

(Let's just forget the (new) -e option for now, nothing material changes
in that case.)

Here the requirement is that the exit status is 0 if "The current working
directory was successfully changed" (and PWD was set, but that's not an
issue here), and ">0 if an error occurred".   Since the exit status cannot
be both 0 and >0, by definition (there is only one exit status), if the
directory was successfully changed (and PWD updated) then it is impossible
for an error to have occurred.

That is, even if:

82490 STDOUT
82491 If a non-empty directory name from CDPATH is used, or if the operand '-' 
is used, an absolute

82492 pathname of the new working directory shall be written to the standard 
output as follows:

fails.   That is, a write error writing to standard output cannot be
treated as an error that occurred, otherwise cd would be required to
exit both 0 and >0.

Again, this conforms with the (ancient) industry practice of ignoring
write errors on standard output (whether you, or anyone else, believes
that is a good thing, or reckless foolishness).

If it is not an error for cd, then unless there is text somewhere to the
contrary, it is not an error that occurred for pwd either (the wording in
the STDOUT sections of the two commands is essentially the same).   No such
text has been quoted, so I assume no such text exists.

  | The text I have already referred to is perfectly sufficient.

No, it is not.

  | I am drawing
  | a conclusion from it that should be completely obvious to anyone with
  | even a rudimentary knowledge of computer terminology.

You're drawing a conclusion which seems like the only sane thing - but
the standard is not required to be sane if the implementations are not.
If the implementations don't check for these write errors, then it is
wrong of the standard to pretend that they do, and totally unjustifiable
to attempt to legislate to make that happen (that would put the standard
group into the position of being some kind of monopoly cartel).

  | > actually says "shall successfully write to" ...   It doesn't.
  |
  | The word "successful" is in the description of exit status 0.

It is, but they are different things being successful, one is the write
(which isn't so required) the other is completion of the command, which
can be successful (as insane as it looks) when the write has failed.

  | > Really?   Given all this unexplained data loss, there must be a whole
  | > raft of bug reports, and/or fixes, over the years, I assume that you
  | > have evidence of that, or are you just guessing?
  |
  | Are you genuinely asking me for *evidence* that some particular thing
  | has caused an event to go unnoticed or unexplained?

I am.

Otherwise what you're doing is just spreading FUD.

What if I were to assert that there must be many unexplained deaths in
the UK each year from undetected cases of yellow fever?  No-one tests for
that any more, as no-one believes it ever occurs, hence, there simply must
be many undetected cases (by your logic).  It certainly could be happening.

Nonsense.

One cannot claim that something must be happening simply because one can
show that it might happen, unless one has some evidence that it really does
happen (in real world cases).   If you have the evidence, from examining a
sampling of cases where unexplained things happened, where you can produce
enough evidence to show that an undetected write error to standard output
caused a loss that wasn't otherwise noticed, then perhaps we can believe
that there are other similar cases - but you have to show that it really
happens (in real world cases, not imagined or test scenarios) first.

  | I didn't claim these events aren't rare - I said they had undoubtedly
  | happened many times. That's across dozens of utilities used by millions
  | of users over almost three decades.

And if they had, someone would have noticed, at least a few times, and
reported it.   Otherwise all you have us supposition, guess work, FUD.

  | It's easy to see how data loss caused by an ENOSPC error could go
  | unnoticed or unexplained if not diagnosed by a utility.  Here's just
  | one plausible scenario...

Actually not so plausible.

  |     You kick off a "find ... -exec grep -l ... {} +" command and go and
  |     make a coffee.

I wouldn't, but that's beside the point (I'm not addicted to caffeine, or
other drugs...)

  |     Another user has by mistake executed a runaway command that is
  |     filling the disk.

That may have been plausible once, but rarely is any more - not because
runaway commands no longer fill disks, though that's getting harder and
harder to achieve with the amounts of storage around these days, but because
this "another user" is very unlikely to exist any more - computers have
become so cheap (along with attached storage) that almost no-one runs real
commands like this in shared systems any more (long term or project storage
may be shared, but local computing, and working, happens on local systems).

But again, we can ignore that, as it still is possible.

  |     When it fails with an "out of space" error they
  |     realise their mistake and remove the huge file they created.

And that could happen, though amy reasonable system is going to leave
evidence in the system error log that a filesystem was full, so it is
easy to see that it happened.

  |     During the short time the disk was full, one of your grep commands
  |     failed to write some output but did not report it as an error.

That might happen.

  |     When you return from your coffee break your find command has finished
  |     and there is no indication that anything went wrong.

But not that.   Sure, one of the grep commands might have "failed to write
some output", but the commands in something like that, and the blocks in
the filesystem aren't synchronised.   What you're almost certainly going to
get is partial output from that grep that failed, up to the end of the block
that was part full before the filesystem was filled up, but not the rest of
it - when the next grep (after the filesystem has had space returned) runs
and produces output, its output will be shoved in what appears to be the
middle of the previous output.

The resulting output file will appear corrupted, it won't simply be missing a
line.  It might be different if the filesystem was full before the command
started, so the very first output couldn't be written (that would be at the
start of a block) - but then the user running this is very likely to notice
that the filesystem is full, because they are active at the time (you can no
longer imagine all of this happening when they're not paying attention).

But regardless of all of this, you're still completely missing the point.

I am not arguing that it is a good thing that commands don't detect write
errors on standard output (except perhaps in cases like cd, and rm -v, and
another one or two I will mention below, and similar cases, where the
output is ancillary to the actual work being done) - what I am arguing that
the standard, as it is written now, does not require such checks, and that
that is arguably the correct thing for the standard to say.

  | Your users may well be among those who have suffered data loss because
  | of this bug, but they don't know that this bug was the cause, so they
  | can't report it.

Users very often don't know the actual cause of bugs, but they report
them anyway.   When reported, they're investigated, someone works out
what might have happened, and if appropriate fixes things.   No reports at
all means nothing went wrong (some users might simply say "huh?" and
try again, others report every little thing that doesn't work exactly as
they think it should, right or wrong.)

And to conclude, as unless something that's actually new appears here,
as in most probably, text that is currently in the standard that says
something different that I am assuming (and which no-one has been able
to find up to now ... but as I said before, it is a BIG standard, and
sometimes it takes time to happen across the magic sentence), then I am
unlikely to continue this, as I believe it to be clear that the standard
does not require this.

As I was saying, to conclude, consider two more commands which are required
to write to standard output, and where exiting 0, even in the face of write
errors is probably the right thing to do.

First, make:

98356 STDOUT
98357 The make utility shall write all commands to be executed to standard 
output unless the -s option
98358 was specified, the command is prefixed with an at-sign, or the special 
target .SILENT has either


Forget the "unless" stuff, we will just consider the cases where none of
that is true.

98921 When the -q option is not specified, the make utility shall exit with one 
of the following values:
98922 0 Successful completion.
98923 >0 An error occurred.

Again, we can just ignore -q (not use it) but not doing so just changes >0
to >1 (so is irrelevant).

In general, people want make to exit 0 if it successfully built the target
(or the target was already up to date) - while that is happening, make
typically writes lots to stdout, and those writes (or some of them) might
fail.   We still typically want exit(0) to mean "target exists and is up
to date" and no more than that.


Second, consider jobs, this one is a command whose primary purpose is to
write to standard output (unlike make), so would seem to be a case where
exit(0) should (if anywhere) mean that those writes worked.

But:

94110 DESCRIPTION
94111 The jobs utility shall display the status of jobs that were started in 
the current shell environment;
94112 see Section 2.12 (on page 2348).

94151 STDOUT
94152 If the -p option is specified, the output shall consist of one line for 
each process ID:
94153 "%d\n", <process ID>

94154 Otherwise, if the -l option is not specified, the output shall be a 
series of lines of the form:
94155 "[%d] %c %s %s\n", <job-number>, <current>, <state>, <command>

94197 EXIT STATUS
94198 The following exit values shall be returned:
94199 0 Successful completion.
94200 >0 An error occurred.

All looks straightforward, right, except for this bit...

94113 When jobs reports the termination status of a job, the shell shall remove 
its process ID from the

94114 list of those ``known in the current shell execution environment''; see 
Section 2.9.3.1 (on page

94115  2336).

When that is considered, it gets much harder to work out how to deal with
write errors to standard output.   If "reports" (which means writes to
standard output) there means "successfully writes" then we have a problem,
as the shell (where jobs is built in, otherwise it has nothing to report
and this whole issue is moot) will write several lines to its output buffer
before doing a "write" system, call, and when that fails, it won't know what
has been reported, and what has not, so it will have no idea which jobs
should be removed from the process table.

On the other hand is "reports" just means "attempts to write" then there is
no issue, we ignore the write error, in the unlikely case it happens, and
simply remove any job whose output we attempted to wrote.   Which I believe
is what shells actually do.

There's lots more like this - it really is no surprise that no-one has ever
wanted to deal with this can of worms, by actually requiring what you believe
should be required.

kre

Re: utilities and write errors

Reply via email to