On 08/07/2012 06:37 PM, Matt Wagner wrote:
Hi all,
I had a few questions on
https://www.aeolusproject.org/redmine/issues/3623 and its implementation
task, #3624.
It all makes sense on the surface -- if a launch fails, or a user tries
to launch but doesn't have permission (which is really a subset of
"a launch fails"), we want to record a notification and display it.
But I find that, as I go to implement this, things become hazier.
For one, if you don't have permission to launch a deployable, the button
is disabled. We can have the controller log a notification if it refuses
permission anyway (such as if someone hits the URL directly), but it
doesn't feel like I'm tackling an urgent customer priority there. Am I
overlooking a case where a user would be able to go launch a deployable
but then have it fail with a permissions error after they pressed the
button?
No, you can't have failed deployment because of permissions check [1]. I
think that by "...or are not authorised" in #3623 description Angus
meant not-yet-implemented declined approval (a user asks for launch
approval which is declined).
Second, I'm having a hard time even finding a way to intentionally cause
a launch failure. For the most part, if it's something not on the cloud
provider itself, we seem to detect it and disable the Launch button.
Is there any obvious use case I'm overlooking?
The launch process has two phases:
1) transaction in which we create a deployment and do all possible
checks. If something goes wrong, rollback is done, deployment is not
created and a user stays on deployment new page with descriptive error
message (see app/models/deployment.rb#launch! method)
2) sending of launch requests - this is done on background by
delayed_job because it may take more time. If something goes wrong,
deployment (and instances) stays created and create_failed state is set.
You are right that these are mostly errors returned by dc-api or config
server.
(see app/models/deployment.rb#send_launch_requests)
These are use cases Justin and Joe mentioned. Another examples:
- provider is not accessible
- provider account doesn't work (wrong credentials, disabled)
- dc-api is not running (well, we check this every 5 minutes)
- hw profile mismatch
For failures that _are_ on the cloud provider, those will already show
up in our "Alerts" section, which lists failed instances.
Well, not exactly - a failed instance will show up there but w/o a
reason why the instance failed because last_error is not set when:
- launch request fails
- deployment rollback is done and all instances which were not launched
are marked as create_failed.
Am I overlooking some conditions / use cases here? I'm not finding a
whole lot that needs to be done.
I think this task is mostly about improving notification UI (currently
represented by 'Alerts' section) and about what information should be
there. I had a call about this with Scott, Angus and Jaromir - It would
be best to contact Jaromir and discuss this with him too (CCing him) -
he has some ideas about this.
-- Matt
Jan
[1] OT: now I realized that we will have to re-check permissions and
quota on delayed_job side anyway even though we don't support delayed
launch (launch requests are now sent immedietly after it's queued) - if
delayed_job is down for a while, queued launch requests will be executed
with delay after it's restarted. Anyway this is not part of #3623.