Re: Automatic retries of hooks

William Reade Wed, 20 Jan 2016 02:34:25 -0800

On Tue, Jan 19, 2016 at 3:14 PM, James Page <[email protected]> wrote:
>
> I think this is a dangerous behaviour to introduce to Juju; a hook error
> should be a signal to an end user that something really bad happened, and
> that they need to dig in further (preferably with points from status
> messages); if the function that a hook is performing is re-tryable, that
> needs to be handled in charm and not by Juju IMHO.
>


There are a few problems with this.

0) The function that a hook is performing *must* be retryable anyway. Hooks
need to be idempotent; we guarantee at-least-once execution, not
at-most-once.

1) As a user, what a hook error means in practice is "retry the hook" (good
thing all those hooks are idempotent...). Most users aren't in a position
to debug their charm if it goes wrong, so their only actual interaction is
basically a thoughtless pavlovian response, the absence of which can leave
an environment needlessly hosed until a human notices it. May as well
automate it for better UX *and* happier outcomes.

2) In any given hook, the ratio of known errors to possible errors is
approximately 0:1 [0]. Those infinitesimally few known errors should indeed
set statuses before failing out (even if you have to look in status history
to see them); but we have to be mindful of the vast majority of cases,
where we have *no idea* what could have gone wrong. And in that case, the
only functional response is to retry -- some unknown errors may be fatal,
but to *assume* they are risks locking up the system on every transient
blip.

3) Finally, now that you have the choice, I'd advise against in-hook
retries: (i) the longer you sit in one hook retrying, the longer all
colocated units are blocked [1]; and (ii) delegating the retries to the
infrastructure lets you write much much cleaner code [2].

Are there any concerns that I've missed?

Specifically I was testing some changes to the odl-controller charm; this
> feature covered up a race in the charm hook code accessing the API of ODL,
> which I failed to notice the first few times I deployed (not paying
> attention due to multi-tasking), and then had me scratching my head as to
> what was going on when I started to notice the hook failure.
>

You say "covered up a race", I say "automatically resolved the problem for
you" :-).

Cheers
William

[0] this applies to any code really, inside or outside juju, it's not
specific to hooks at all.
[1] and while it may not be *common* I'm pretty sure it'd be *possible* for
a hook to deadlock like this; would prefer not to encourage that.
[2] this is also widely applicable: adding retry logic *within* an
idempotent operation is basically always worse than building independent
operation-retrying infrastructure and reusing that where necessary.

-- 
Juju-dev mailing list
[email protected]
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Re: Automatic retries of hooks

Reply via email to