I had a lively discussion yesterday with OpenStack Nova cores about the reset 
server state. At first how to have that by one API call for all VMs on a host 
(hypervisor) as discussed in DOCTOR-78. But then it came to a question why we 
actually want the reset server state in the first place. It is not something 
that need to do if force down a host. If we want a notification about effected 
VMs and further an alarm, then that is another thing. So if we want that kind 
of notification, it is then something we should make a spec. Not to reset state 
to error for each VM on a host that we should not be doing in the first place 
if error was not on VM, but host level (yes before you ask, Nova can have the 
working VM state unchanged if host is down. You do not touch VM state if you do 
not want to do something for the VM or if it was actually the one having error. 
Yes and you do not want to do anything for the VM itself in all scenarios, but 
just be happy it comes up again on same host when host comes back.)

Again I realize here and what I have said a long ago before we had anything. It 
will not be possible to make alarms correctly by changing state in Nova and 
other controllers and then triggering alarm from the notification about those 
state changes. That will never have what we want for the alarms, while 
otherwise we sure need to correct states. Even for things we get a notification 
triggered by state change, we will not have information needed in alarm and 
surely we do not call APIs in vain, just to have alarm (like reset server 
state) .

We want tenant/VNFM  specific alarms to tells which his VMs (virtual resources) 
are effected by fault and a cause (and surely alarms about physical faults that 
will not be consumed by tenant/VNFM and other fields needed by ETSI spec). Only 
way of having this correct for each kind of fault that can appear, is to form 
all the alarms (notification to form alarm) in the Inspector (Congress or 
Vitrage). It is the only place that has all the information needed in different 
scenarios and can make this right and has the minimum delay that is crucial in 
Telco fault management. Also if looking to have OPNFV used in production and 
one would need to be OPNFV compliant, it means we need to make things right. I 
strongly suggest that while we have the way we make alarm as a great step we 
have achieved so far as proof of concept (changing states and having alarm 
under 1 second), let's make next steps to go towards having conceptually 
correct way to achieve this and have correct alarms.


opnfv-tech-discuss mailing list

Reply via email to