Re: s6-rc transition failures

2017-06-16 Thread Van Bemten, Lionel (Nokia - BE/Antwerp)
Hello Laurent,

Thanks for your answers. Other opinions/experience welcome.

>  That is a fair point. Normally, you should adjust the s6-rc
> timeouts (both the global one and the service-specific one) to
> make sure s6-rc does *not* time out before the service is ready -
> but if there's an unexpected significant delay, the situation can
> happen.

Just to be clear I am talking about a service going into an infinite
loop or deadlock. Obviously a bad service but I want to protect my
system against it.

>  What I can do is add an option to s6-rc to make it explicitly send
> a s6-svc -d to a service that times out before reaching readiness:
> ensure that a service is either ready in time, or definitely down.
> Would that help?

Yes that would help. I suppose you also mean to wait for the
service to go down before returning ?

>  The annoying thing is it can't be symmetrical: when a down
> transition times out, there's no way I'm going to start the service
> again. :) But generally, a down transition timing out signifies a
> badly written finish script, or badly calibrated timeouts, and
> it can be easily solved by running s6-rc -d change again.

I agree. I would add that if timeout-down > timeout-kill + timeout-finish
+ some margin, the down transition should generally never time out.

>  What I can do is add a bit of signal handling to s6-rc, so that if
> it gets interrupted, say with a SIGINT or SIGTERM, it exits ASAP,
> while still ensuring consistency of the service states.

I was thinking exactly the same :). I even think this could be tailored
to system shutdown (I do not see another use case). E.g. for ongoing
longrun up transitions, s6-rc could act as if the transition timed out
and send "s6-svc -d". For ongoing longrun down transitions I am not
sure whether it should wait for it to complete or not.

>  Unfortunately, for oneshots it would mean waiting for the current
> transitions to finish before exiting - s6-rc has no way to interrupt
> a running oneshot, and adding one (making s6rc-oneshot-runner kill
> all its children) would not help, because until the oneshot script
> exits, it is not visible from the outside whether it has accomplished
> its transition or not - so the state would still be undetermined.

I tend to think this would not be too much of a problem as I picture
oneshots as having timeout-up and timeout-down of a few seconds,
as opposed to longruns having timeouts of one or two minutes.
But this assumption may be totally wrong.

>  Also, state consistency cannot be 100% ensured, because s6-rc could
> still receive a SIGKILL - but if you kill -9 s6-rc, you deserve
> trouble.

I won't kill -9 s6-rc, I promise.

Kr,
Lionel

s6-rc transition failures

2017-06-15 Thread Van Bemten, Lionel (Nokia - BE/Antwerp)
Hello all,

I am facing questions regarding the way to correctly handle
transition failures with s6-rc. The new permanent failure feature
already clarifies some scenarios but I still have doubts about
some cases. Below are two concrete examples. I would
be happy to have remarks or suggestions about how to cope
with them clean and nice :).

1. I start a longrun service with "s6-rc -u change svc". This
service hangs and never reaches readiness notification. After
timeout s6-rc will declare the transition a failure. But the process
is actually running and I have no way to stop it through s6-rc.
The only way is to issue "s6-svc -d /path/to/svc". But then I have
the feeling I am doing something in the back of s6-rc to unblock
the situation because s6-rc cannot handle it. Somehow I wish
s6-rc had an extra state BUSY between DOWN and UP when
the transition is ongoing, and that it could bring me back to
DOWN when I ask it to.

2. Slightly related, I have an issue with system shutdown. I am
working on a buildroot system and specifically I use the
/etc/rc.tini which can be found here [1] and which is executed
as part of the shutdown sequence of the system. The problem
is with the invocation of "s6-rc -b -da change" (I added the -b).
If there is already an s6-rc ongoing, the shutdown sequence will
be blocked until the first s6-rc times out. And this kind of timeout
is of the order of minutes as I have slow services depending
on each other. I currently think the best thing to do is to is to
"killall s6-rc" before calling "s6-rc -ad change". This leaves a little
race condition possible, but more importantly, I have concerns
about killing an ongoing s6-rc. This will leave longrun services
in the middle of a state transition - there is the connection with
the first scenario - and I expect the final effect is that the
finish script will not be executed before the system goes down,
which is precisely what I want to happen when I call
"s6-rc -ad change". Secondly, I do not know what effect this will
have on oneshots. I fear "/etc/init.d/S98xxx start" will still be
running and "/etc/init.d/S98xxx stop" will be executed - the thought
of which horrifies me beyond reasoning.

[1] 
https://github.com/elebihan/s6-br2-init-skeleton/blob/master/data/skeleton/etc/rc.tini

Thanks in advance for any idea that might help me :).

Kr,
Lionel

Re: [announce] skarnet.org Spring 2017 release: s6 init fails with buildroot on MIPS

2017-05-11 Thread Van Bemten, Lionel (Nokia - BE/Antwerp)
Hello all,

After upgrading to the below release of skarnet packages, my system
is not booting anymore on MIPS. I am using buildroot and the 
s6-linux-init-skeleton
provided there by Éric Le Bihan.

The init usually stays blocked there:

  s6-rc: info: processing service fdholder: starting

with the following processes running:

  136 root s6-rc -v 3 -u change services-all
  143 root s6-fdholderd -1 -i data/rules
  146 root s6-ipcserverd -1 -- s6-ipcserver-access -v0 -E -l0 -i data/rules 
-- s6-sudod -t 2000 -- /usr/libexec/s6-rc-oneshot-run -l
  173 root s6-svlisten1 -u -- /run/s6-rc/scandir/klogd-log s6-svc -u -- 
/run/s6-rc/scandir/klogd-log
  183 root s6-ftrigrd
  296 root s6-svlisten1 -U -- /run/s6-rc/scandir/fdholder s6-svc -u -- 
/run/s6-rc/scandir/fdholder
  297 root s6-ftrigrd

and I regularly see the following message, even though not consistently:

  s6-ftrigrd: fatal: unable to flush asyncout: Broken pipe

When I build the very same system for x86 I do not see any issue, it is very 
stable, so I
suspect a cross-compilation issue. Since buildroot patches skalibs in order to 
determine
system capabilities through build-time asserts instead of runtime tests, it is 
possible that
something goes wrong there. However I have adapted the patches to the new 
release,
and so far I cannot identify any error in the detected system settings.

If anyone has any idea to help the investigation, it would be highly 
appreciated :).
Maybe for example a hint on what could cause s6-ftrigrd to get a broken pipe ?

Kr,
Lionel

From: supervision@list.skarnet.org  on behalf of 
Jean Louis 
Sent: Tuesday, March 28, 2017 3:45 PM
To: Laurent Bercot
Cc: skaw...@list.skarnet.org; supervision@list.skarnet.org
Subject: Re: [announce] skarnet.org Spring 2017 release

By the way, I have upgraded, without control, I see it is working well
and fine on a reboot.

On Tue, Mar 28, 2017 at 12:27:28PM +, Laurent Bercot wrote:
> >  * s6-2.5.0.0
> > --
>  And obviously I forgot to mention the important change for users:
> s6-svstat can now print programmatically parsable output, via a new
> "-o field" option (and shortcuts for common fields). This feature was
> asked for a long time ago.
>
> --
>  Laurent
>


Re: Customise shutdown signal at the s6-rc level?

2017-05-02 Thread Van Bemten, Lionel (Nokia - BE/Antwerp)
> Doesn't
>
>svc -wD -T1000 servicedir || svc -k servicedir
>
> do what you want for the "hangup problem" ?

Problem is that you can't trigger that from s6-rc.

Lionel

Re: Customise shutdown signal at the s6-rc level?

2017-05-02 Thread Van Bemten, Lionel (Nokia - BE/Antwerp)
What I would like to have is "s6-rc -d change foo" sending SIGTERM and then
SIGKILL if the service is not down after x seconds. Currently If a daemon hangs
it has annoying side effects. If I don't put a timeout on the s6-rc command
the state machine is blocked, and I cannot shut down the system anymore. If I do
put a timeout, s6-rc considers the service is down while it is not. It is then 
impossible
to stop or restart it. When timeout occurs, maybe s6-rc should send SIGKILL to
make sure the service is down ?

Lionel

From: supervision@list.skarnet.org  on behalf of 
Laurent Bercot 
Sent: Tuesday, May 2, 2017 10:51:19 AM
To: supervision@list.skarnet.org
Subject: Re: Customise shutdown signal at the s6-rc level?

>[1] 
  Damn ezmlm-cgi bug, this time it triggered without an accented
character. Sorry about that :(
  Here's the URL to the full message:
  https://www.mail-archive.com/supervision@list.skarnet.org/msg01427.html

  About customizing shutdown signals in s6-rc: how do you suggest it
should
be done? There are two ways I can see it work:
  1. by including a call to the "trap" binary in the generated run
script.
  2. by including the "what signal should I send" information into the
generated service database and making "s6-rc -d change foo" use that
information.

  Solution 1 means adding magic that changes the process tree, and I'm
very reluctant to do that. s6-rc-compile already performs magic with
run scripts (to move around fds for pipelining and notification), but it
doesn't change the final process tree: the long-lived process *is* the
user's run script. Adding a call to trap would break that expectation,
and I don't think it's worth it: users who need it can perform the
trapping in their run script themselves. (It's trivial to do in shell.)

  Solution 2 means changing the s6-rc database ABI by adding a field
(which is doable at the cost of a major version bump and recompilation
of every user database), but more importantly, it means that s6-svc -d
is not automatically valid anymore, and more investigation (looking into
the s6-rc-database to find the correct signal...) is necessary for a
user to know the right way to temporarily stop a service. It's an
additional know-how burden I don't want to put on users, most of whom
are already having a hard enough time with s6 as is.

  Neither of those solutions is appealing to me. Currently, if someone
needs to use a different shutdown signal, I would recommend them to
manually perform a trap in their run script, be it with s6 or with
s6-rc.

  If I were to work on a more official, better integrated solution, I
would
do it at the s6-supervise level. I would not implement custom control
scripts, for the reasons indicated in the above link, but it would
probably be possible to implement a safer solution, such as reading a
file
containing the name of the signal to send when s6-svc -d is called.

  Is there a real, important demand for this? I'd rather not do it and
fix
daemons that don't use SIGTERM to shutdown instead...

--
  Laurent