Re: Please pick up my OpenStack task!

Jan Provaznik Tue, 25 Sep 2012 06:39:29 -0700

On 09/25/2012 02:53 AM, Matt Wagner wrote:

On Mon, Sep 24, 2012 at 04:55:47PM +0100, Angus Thomas wrote:

On 09/24/2012 04:04 PM, Jan Provaznik wrote:


Doing that thing where I reply to two emails at once here...

Hi Matt,
thanks for pushing this forward. After brief look at this I have
some good news and bad news. Good news first:
It's quite easy to make Conductor working with Openstack for
imported images, what is missing:


Thanks for picking this up, Jan! Glad to hear it works, at least in
part.

1) there is a bug in current dc-core rpm (1.0.3-1.fc17) so getting
of instance details doesn't work properly. Though this is already
fixed in master branch, so it should be OK once 1.0.4 is released.


I would _love_ it if we were able to get >= 1.0.4 available / shipped
with Aeolus for future releases. I'm sufficiently oblivious to
Deltacloud packaging efforts to know if that's feasible / already the
plan or not, but it would work well for us!

dc-core has "at beginning of each month" release cycle, we are lucky -next release should be next week

2) Once an openstack instance is stopped, it disappears from server
immediately so dbomatic gets "not found" response when checking
state, IOW the instance doesn't stay in 'stopped' state on server.
For vanished instances dbomatic will have to check if last instance
action was stop request and if so mark such instance as stopped
(though this solution is far from ideal).


Haven't we had this same problem with either RHEV or vSphere?

The problem is that we assume a provider will list a stopped instance
for a while after we stop it, because many of the prominent clouds do
this. But there's really no actual requirement that providers do this.
In fact, while convenient for us, it almost feels a bit weird to include
them in the list. Maybe marking them as 'vanished' as if they are in an
error state is not the right thing to do anymore.

It depends how is Conductor commonly used by customer - if Conductor isexpected to be "exclusive" or at least primary interface for managinginstances and instance changes are done through Conductor, then usage of'vanished' state doesn't seem to be wrong. Conductor knows about allactions on an instance so we should be able to distinguish betweenexpected (after stop/destroy action) and unexpected 'vanished' state.

But if it turns out that instances are commonly managed through otherways (other than Conductor) then we should probably come with somethingbetter.

Currently only EC2 non-EBS instances disappear after a while from'stop'action. I agree that relying on fact that these instances arelisted for a while after we stop it is not good and we have to improve this.

For Openstack, it turned out that the behaviour I described before(instance disappears from server immediately after 'stop' request) iscaused by 2 things:1) dc-core handles all openstack instances as stateless - stop actionmeans destroying instance on openstack server -> this is why itdisappears immediately2) there was a bug in dc-core and it didn't update instance state aftersuccessful stop operation. Once these 2 bugs are fixed/pushed, instancestate will be set to 'stopped' in Conductor right after stop action:

http://tracker.deltacloud.org/patch/957a640e
https://issues.apache.org/jira/browse/DTACLOUD-328

Bad news:
Openstack is not supported in imagefactory-1.1, it's supported
since imagefactory-1.2. And there were significant changes in API
between these versions. We would have to do many changes in current
Conductor code to  make it 1.2 compatible. So build+push support is
blocked by IME integration into Conductor - IME is "1.2 ready" so
once it's integrated this should work too.


Ick. At least we have a lot to look forward to -- IME should be exciting
in and of itself for architectural reasons. Happening to bring support
for Openstack building, in an indirect way, is icing on the cake.

It seems to me that having working support for openstack, with the
ability to launch multi-instance deployments etc., with the caveat
that you have to start with imported images, rather than building &
pushing images, is still a pretty good place to be, for now at least.

Jan, how much can we improve the issue with instance state reporting
whilst IME integration is coming along?

Right now I hope that no fixes on Conductor side are needed, once theabove 2 bugs are fixed in dc-core. Though I didn't get to testingmulti-assmebly deployemnt and launch time params.

I'm trying to remember why we even have the 'vanished' state. It's only
triggered when one cycle we know an instance's status, and the next time
we poll for it there is no trace that it ever existed. We regard that as
an error condition and mark it as 'vanished.'

Well, we need some way how to deal with instances which disappeared fromprovider for any reason (were deleted, permissions have changed).

The problem is that, as I discussed before, I don't think it's always
safe or reasonable to assume that instances will spend several polling
cycles in a stopped state. They do one some providers, but not all. I


Agree with you, we shouldn't rely on this.

also vaguely recall that it may have been added in response to a rare
bug on the RHEV PowerShell API or something like that, where it would
occasionally report only a partial list of instances. (Is this fact, or
am I mis-remembering?)

Maybe, I'm not sure. IIRC there was (actually still is) a bug in RHEV orVsphere when a provider sometimes returns "not found" error for aninstance get request, but next get request for this instance passes -IOW we shouldn't set 'vanished' state after first fail, but this isprobably not what you mean.

So maybe we shouldn't assume that Running->No Info means "vanished," and
just accept that it's normal for some providers. (This assumes, possibly
incorrectly, that we don't mark things as vanished on network or
Deltacloud hiccups or whatnot.) But I'm not sure it's always safe to
assume that the instance is stopped because it disappear from our list
-- though it's more likely that that's what happened than that the
provider's API has a bug that caused it to not be reported.

Yea, I'm not sure we could assume vanished instances automatically asstopped.


Not to try to pawn our tough problems on other people, but I wonder if
this is a question for the Deltacloud folks. This feels a lot like the
type of per-provider detail that they seek to spare us from, and for all
I know they already have a good solution here.

-- Matt

Jan

Re: Please pick up my OpenStack task!

Reply via email to