Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

Juvonen, Tomi (Nokia - FI/Espoo) Thu, 29 Sep 2016 02:23:25 -0700

Hi,

Inline, hope it explains some.


As short the alternatives:
1./a. Have reset server state for all VMs on host with single API call. Either 
with 1./a/1. host specific reset server state or 1./a/2. host force down API.
1./b No reset server state is done. host force down API would trigger new kind 
of notification for each tenant about effected VMs.
2. No notification trough controller. Inspector would form notifications to 
notifier. Easier to tailor notifications and alarms as we want. Only host force 
down API called be Inspector, no reset server state.

Br,
Tomi

From: Yujun Zhang [mailto:zhangyujun+...@gmail.com]
Sent: Thursday, September 29, 2016 9:37 AM
To: Juvonen, Tomi (Nokia - FI/Espoo) <tomi.juvo...@nokia.com>; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi, Tomi

Thanks for the summary.

I am a bit confused about the difference between the 2. and 1./b.  Would you 
please give an example to explain how it would work?
In 2. Inspector send the notification, not controller. That means notification 
can be tailored exactly to meet the needs. This is not the case with 1./b. 2. 
would assume we can tailor the notification and alarm(s) the way we want it to 
be.

Suppose we have

- tenant-a
  - vm-a on host-a
- tenant-b
  - vm-b on host-a

When a raw failure occurs on host-a, the existing sequence[1] would be

1. Monitor send "host-a failure" event to Inspector
2. Inspector find affected VMs (get all vm in host-a)
Inspector should know already  VMs on host, it shouldn’t get them here anymore 
if properly implemented. Anyhow you are right, this is what we have currently.
3. Inspector resets affected VMs (vm-a and vm-b) to error state
Currently Inspector reset servers to error state to get notification form 
controller (Nova). “2.” And “1./b.” States Inspector should not reset servers 
to error state. Only force down host.
4. Controller request Notifier to notify all
In 2. Controller is not the one making notification that triggers alarm.
...

I think this is how "1./a." works.
yes

For "1./b." it seems to be close to the alternative sequence in fault 
management scenario[2]. Instead of waiting for Controller to send notification, 
the Inspector will directly inform the Notifier about it.
1./b. send notification from controller for all the VMs when force down host 
API is called on Nova. In 2. Notification to is send directly from Inspector to 
notifier.
Apparently, 5a is mandatory before 5b and 5c. But 5b and 5c. (alt) can
be triggered simultaneously with async calls.

If we deploy vitrage as the inspector, VMs state error could be deduced and 
notified independently from "5b. Update State" action. Then the time required 
for updating all VMs state would not matter any more.
“Get valid server state”  work fulfilled the VM to have host_status so one get 
proper state when host down. This was done exactly as when host is down, there 
was no indication when querying servers that it has a problem (trough Nova 
servers API). This was as reset server state was not to be called. Also no 
existing vm state field was to be changed to indicate host is down. Anyhow if 
we in Doctor still insist that reset server state should be called, it is great 
it can be done independently as you say.

"1./b." looks good to me but I'd like to hear more on "2."
1. Monitor send "host-a failure" event to Inspector
2. Inspector knows topology already, so it internally figures out the VMs on 
host by different tenants
3. Inspector force down host (and does fencing of host that we have nothing 
done currently)
4. Inspector  sends needed notifications to notifier to form exactly the alarms 
needed by tenants (and probably also better alarm for physical fault than could 
be made by notification about nova-compute service state change trough 
service.update notification)

[1] http://artifacts.opnfv.org/doctor/docs/index.html#figure-p1
Yes the figure shows trough controller, but it has been discussed already 
earlier wither this is a good idea.
[2] http://artifacts.opnfv.org/doctor/docs/index.html#figure8

On Wed, Sep 28, 2016 at 1:27 PM Juvonen, Tomi (Nokia - FI/Espoo) 
<tomi.juvo...@nokia.com<mailto:tomi.juvo...@nokia.com>> wrote:
Hi,

As discussed yesterday in the Doctor meeting, there is several ways to approach 
the problem and many different aspects. If trying to make blueprint to 
OpenStack Nova, there is a window now open to do it couple of weeks to make it 
in next Ocata release (or Danube in OPNFV). Not sure if time to make that, but 
here is a summary:


1.      The way we use “reset server state” is not the way it is used in the 
OpenStack. Force down host doesn’t need resetting servers state.
Do we want to state that we still want to use it anyway because we want the 
notification to have alarm?

a.      Yes:

1.      Do we want to enhance the functionality to reset servers state for all 
servers on a host?

2.      Do we want force down API to be able to optionally reset server state 
for all VMs on host?
Note! “Get valid server state” was done because the reason that there is no 
server specific state changing when there is a host specific fault (as reset 
server state is not called). This is why a host_status field was added for user 
querying his server to know there is nothing wrong with his VM, but it is 
currently down as host is in that state.


b.      No:
We could try to have a change when calling force down host, it would send a 
notification about effected VMs (as many notifications as there is tenants with 
VMs).


2.      Only inspector knows everything that is needed for different alarms and 
it is just overhead to push that information trough for example Nova to get 
notification that can translate to alarm. Also we do not get the right content 
to alarms anyhow.  This leads to a fact that only way to have things right is 
to send notification from inspector to notifier to have right kind of alarms: 
Tenant specific alarms with their VMs and separate physical fault alarm (with 
respect to ETSI GS NFV-IFA 005)

IMHO the only right choice is “2.” Next one would be the “1. / b.”. The least 
feasible thing would be to do the “1. / a.”.

Br,
Tomi

_______________________________________________
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

Reply via email to