Re: RFC: uniform exit codes

Elias Pipping Wed, 07 Sep 2016 12:54:45 -0700

> On 2 Sep 2016, at 16:53, Elias Pipping <pipping.el...@icloud.com> wrote:
> 
> Dear list,
> 
> I’d like to talk about exit codes for a bit. The 
> uiop/run-program::%wait-process-result function e.g. currently waits for a 
> process to terminate and then returns something. An exit-code. What would you 
> expect that to be? What would you like it to be?
> 
> (I’ve already had a long discussion about this with Robert but Faré asked me 
> to take it to the mailing list, too).
> 
> An exit code should be a number and lie between 0 and 255, with only 0 
> signalling success, as far as I understand. A ‘return -1’ in C ends up as a 
> 255 once I check for it in lisp or shell. Beyond that there are customs but 
> not standards (there is sysexits.h but it’s not used all that much)
> 
> Please consider the three shell scripts that each contain just one line:
> 
> (1) exit 15
> (2) kill $$
> (3) sh -c ‘kill $$’
> 
> If you saved them in separate scripts and ran them from within a shell, the 
> exit code would be 15/143/143. The take-away messages from that for me are 
> that
> 
> - the shell uses 128+n if the process dies in response to signal n
> - there are cases where the exit code is greater than 128 even though the 
> process itself did not die in response to a signal, thereby interfering with 
> this logic
> - the shell cannot distinguish (2) and (3)
> 
> From within lisp, often but not always (2) and (3) can be distinguished. 
> Sometimes, a process-wait function will return something like 15/(0 15)/(143 
> 0) for the above examples(*); sometimes a process-status function will report 
> (:exited 15)/(:signaled 15)/(:exited 143).
> 
> But some implementations will behave like the shell and always return 
> 15/143/143, e.g. ABCL, LispWorks <7, and Allegro CL with :wait t.
> 
> So the thing that we can reliably do is produce the sequence 15/143/143. 
> Please note that even this baseline is already a proposal for a change: With 
> today’s UIOP master branch, you could also get things such as 15/15/143, 
> 15/0/143, or 15/:sigterm/143, if I’m not mistaken.
> 
> What we could not reliably do is e.g. return things like 15/-15/143 or (15 
> :exited)/(15 :signaled)/(143 :exited). What I’ve implemented so far is a 
> compromise. Some platforms might return 15/(143 15)/143 and others just 
> 15/143/143. The (143 15) could easily be turned into (143 :signaled) instead, 
> that’s a matter of taste, the take-away message remains, though, that you 
> couldn’t be sure that what you think is case (2) isn’t really case (3). So 
> that leaves also the option of “let’s just not bother with distinguishing the 
> two”.
> 
> Looking forward to your feedback,
> 
> 
> Elias


Dear list,

I’ve now tested what happens on OpenBSD rather than Linux. Not entirely 
unsurprisingly (I simply hadn’t thought about it), it turns out that the 
difference between (2) and (3), namely the additional shell layer, also enters 
here:

  (defun %normalize-command (command)
    ...
    (etypecase command
      #+os-unix (string `("/bin/sh" "-c" ,command))
      #+os-unix (list command)
      …

In other words: Whether %run-program is passed something like “/bin/something 
arg” or (list “/bin/something” “arg”) potentially makes a difference; and 
indeed it does make a difference for (2) (because it can be turned into (3) 
this way). Again, it’s not surprising that it does, but my impression is that 
the user really should not have to worry about this difference, otherwise the 
entire abstraction that %run-program provides breaks down.

So we’re now at a place where distinguishing (:exited 143) and (:signaled 15) 
may or may not work depending on whether you pass a string or a list, what 
operating system you're on, what lisp you’re on, and whether you call your 
script synchronously or asynchronously. I think it’s safe to say we should just 
give up on this undertaking. Return 143 in both cases. We can do that reliably 
now and I’m happy that we can. We should not return additional information if 
it’s not reliable.


Elias

PS: The additional shell layer was something that confused me quite a bit, too, 
when I wrote wrappers around process-status and process signalling functions: 
If you run `sleep 1`, send it SIGSTOP, sleep for 2 seconds, and send it 
SIGCONT, it will run for approximately another second. If you do the same with 
`sh -c ‘sleep 1’`, you’ll get a very different result: The shell will stop but 
`sleep 1` will continue to run. Once you send SIGCONT, the process will 
immediately return. All of this makes perfect sense but it becomes confusing if 
someone turns your `sleep 1` into `sh -c “sleep 1”` without telling you.
One situation where the number of layers of shell is less relevant (but still 
not completely so) is when a process is terminated or when checking if a 
process is still alive (that’s why I made the corresponding functions public 
and the others I mentioned earlier private).
Even here, killing a process will not necessarily kill its children. Windows 
has taskkill /t for that, I believe (a so-called “tree kill”) but I don’t think 
such a think is possible on unix without an additional requirements like 
cgroups on linux (at least I think that’s something systemd uses and requires 
them for).

Re: RFC: uniform exit codes

Reply via email to