Re: Current handling of failed upgrades is screwy

Menno Smits Tue, 15 Jul 2014 19:46:28 -0700

OK - points taken.

So taking your ideas and extending them a little, I'm thinking:


   - retry upgrade steps on failure (with inter-attempt delay)
   - indicate when there's upgrade problems by setting the machine agent
   status
   - if despite the retries the upgrade won't complete, report this in
   status and keep the agent running but with the restricted API in place and
   most workers not complete (i.e. as if the upgrade is still running). This
   allows "juju status" and "juju ssh" to work unless there's a significant
   upgrade step that hasn't run that prevents them from working.

Does that sound reasonable?




On 15 July 2014 19:33, William Reade <william.re...@canonical.com> wrote:

> FWIW, we could set some error status on the affected agent (so users can
> see there's a problem) and make it return 0 (so that upstart doesn't keep
> hammering it); but as jam points out that's not helpful when it's a
> transient error. I'd suggest retrying a few times, with some delay between
> attempts, before we do so (although reporting the error, and making it
> clear that we'll retry automatically, is probably worthwhile).
>
> And, really, I'm not very keen on the prospect of continuing to run when
> we know upgrade steps have failed -- IMO this puts us in an essentially
> unknowable state, and I'd much rather fail hard and early than limp along
> pretending to work correctly. Manual recovery of a failed upgrade will
> surely be tedious whatever we do, but a failed upgrade won't affect the
> operation of properly-written charms -- it's a management failure, so you
> can't scale/relate/whatever, but the actual software deployed will keep
> running. However, I can easily imagine that continuing to run juju agents
> against truly broken state could lead to services actually being shut
> down/misconfigured, and I think that's much more harmful.
>
> Cheers
> William
>
>
> On Thu, Jul 10, 2014 at 9:57 AM, John Meinel <j...@arbash-meinel.com>
> wrote:
>
>> I think it fundamentally comes down to "is the reason upgrade failed
>> transient or permanent", if we can try again later, do so, else log at
>> Error level, and keep on with your life, because that is the only chance of
>> recovery (from what you've said, at least).
>>
>> John
>> =:->
>>
>>
>> On Thu, Jul 10, 2014 at 11:18 AM, Menno Smits <menno.sm...@canonical.com>
>> wrote:
>>
>>> So I've noticed that the way we currently handle failed upgrades in the
>>> machine agent doesn't make a lot of sense.
>>>
>>> Looking at cmd/jujud/machine.go:821, an error is created if
>>> PerformUpgrade() fails but nothing is ever done with it. It's not returned
>>> and it's not logged. This means that if upgrade steps fail, the agent
>>> continues running with the new software version, probably with partially
>>> applied upgrade steps, and there is no way to know.
>>>
>>> I have a unit tested fix ready which causes the machine agent to exit
>>> (by returning the error as a fatalError) if PerformUpgrade fails but before
>>> proposing I realised that's not the right thing to do. The agent's upstart
>>> script will restart the agent and probably cause the upgrade to run and
>>> fail again so we end up with an endless restart loop.
>>>
>>> The error could also be returned as a "non-fatal" (to the runner) error
>>> but that will just cause the upgrade-steps worker to continuously restart,
>>> attempting the upgrade and failing.
>>>
>>> Another approach could be to set the global agent-version back to the
>>> previous software version before killing the machine agent but other agents
>>> may have already upgraded and we can't currently roll them back in any
>>> reliable way.
>>>
>>> Our upgrade story will be improving in the coming weeks (I'm working on
>>> that). In the mean time what should we do?
>>>
>>> Perhaps the safest thing to do is just log the error and keep the agent
>>> running the new version and hope for the best? There is a significant
>>> chance of problems but this is basically what we're doing now (except
>>> without logging that there's a problem).
>>>
>>> Does anyone have a better idea?
>>>
>>> - Menno
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Juju-dev mailing list
>>> Juju-dev@lists.ubuntu.com
>>> Modify settings or unsubscribe at:
>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>
>>>
>>
>> --
>> Juju-dev mailing list
>> Juju-dev@lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>>
>

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Re: Current handling of failed upgrades is screwy

Reply via email to