On 16/07/2015 19:22, Colin Booth wrote:
You're right, ./run is up, and being in ./finish doesn't count as up.
At work we use a lot of runit and have a lot more services that do
cleanup in their ./finish scripts so I'm more used to the runit
handling of down statuses (up for ./run, finish for ./finish, and down
for not running). My personal setup, which is pretty much all on s6
(though migrated from runit), only has informational logging in the
./finish scripts so it's rare for my services to ever be in that
interim state for long enough for anything to notice.

 I did some analysis back in the day, and my conclusion was that
admins really wanted to know whether their service was up as opposed
to... not up; and the finish script is clearly "not up". I did not
foresee a situation like a service manager, where you would need to
wait for a "really down" event.


As for notification, maybe 'd' for when ./run dies, and 'D' for when
./finish ends. Though since s6-supervise SIGKILLs long-running
./finish scripts, it encourages people to do their cleanup elsewhere
and as such removes the main reason why you'd want to be notified on
when your service is really down. If the s6-supervise timer wasn't
there, I'd definitely suggest sending some message when ./finish went
away.

 Yes, I've gotten some flak for the decision to put a hard time limit
on ./finish execution, and I'm not 100% convinced it's the right
decision - but I'm almost 100% convinced it's less wrong than just
allowing ./finish to block forever.

 ./finish is a destroyer, just like close() or free(). It is nigh
impossible to define sensical semantics that allow a destroyer to fail,
because if it does, then what do you do ? void free() is the right
prototype; int close() is a historical mistake.
 Same with ./finish ; and nobody tests ./finish's exit code and that's
okay, but since ./finish is a user-provided script, it has many more
failure modes than just exiting nonzero - in particular, it can hang
(or simply run for ages). The problem is that while it's alive, the
service is still down, and that's not what the admin wants.
Long-running ./finish scripts are almost always a mistake. And that's
why s6-supervise kills ./finish scripts so brutally.

 I think the only satisfactory answer would be to leave it to the user :
keep killing ./finish scripts on a short timer by default, but have
a configuration option to change the timer or remove it entirely. And
with such an option, a "burial notification" when ./finish ends becomes
a possibility.


Ah, gotcha. I was sending explicit timeout values in my s6-rc comands,
not using timeout-up and timeout-down files. Assuming -tN is the
global value, then passing that along definitely makes sense, if
nothing else than to bring its behavior in-line with the behavior of
timeout-up and timeout-down.

 Those pesky little s6-svlisten1 processes will get nerfed.


Part of my job entails dealing with development servers where
automatic deploys happen pretty frequently but service definitions
dont change too often. So having non-privileged access to a subsection
of the supervision tree is more important than having non-privileged
access to the pre- and post- compiled offline stuff.

 I understand. I guess I can make s6-rc-init and s6-rc 0755 while
keeping them in /sbin, where Joe User isn't supposed to find them.


By the way, that's less secure than running a full non-privileged
subtree.

 Oh, absolutely. It's just that a full setuidgid subtree isn't very
common - but for your use case, a full user service database makes
perfect sense.

--
 Laurent

Reply via email to