Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
            notifications.send_update_with_states(context, instance,
            instance.vm_state, vm_states.ERROR,instance.task_state,
            None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu <r-m...@cq.jp.nec.com>; Yujun Zhang <zhangyujun+...@gmail.com>; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,


1.      Tried that 300 while it is also the default, so no difference.



2.      Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.      Then added debug print over this code in modified forced-down API that 
gets servers and does reset state for each.
Run Doctor test case:
With 20VMs where 10 VMs on failing host: 1540ms
In Nova code:
Getting instances:
instances = self.host_api.instance_get_all_by_host(context, host)
Took: 32ms
Looping 10 instances to make reset server state to error:
for instance in instances:
            instance.vm_state = vm_states.ERROR
            instance.task_state = None
            instance.save(admin_state_reset=True)
Took: 1250ms
And can then also pick up the whole time the API took:
2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server 
[req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392 
1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT 
/v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/force-down HTTP/1.1" status: 
200 len: 354 time: 1.4085381

So the usage of reset server state is currently not feasible (and like 
indicated before, shouldn’t even be used).

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Saturday, October 01, 2016 8:54 AM
To: Yujun Zhang <zhangyujun+...@gmail.com<mailto:zhangyujun+...@gmail.com>>; 
Juvonen, Tomi (Nokia - FI/Espoo) 
<tomi.juvo...@nokia.com<mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,


That’s interesting evaluation!

Yes, this should be an important issue.

I doubt that token validation in keystone might take major part of processing 
time. If so, we should consider to use keystone trust which can skip the 
validation or to use token caches. Tomi, can you try the same evaluation with 
enabling token caches in client (by --os-cache) and nova ([keystone_authtoken] 
token_cache_time=300)?

OR, maybe we can check how many HTTP messages and DB query happened per VM 
reset?


Thanks,
Ryota

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Yujun Zhang
Sent: Friday, September 30, 2016 5:53 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) 
<tomi.juvo...@nokia.com<mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

It is almost linear to the number of VMs since the requests are sent one by 
one. And I think we should raise the priority of this issue.

But I wonder what it will perform if the requests are sent simultaneously with 
async calls. How will nova deal with that?
On Fri, Sep 30, 2016 at 4:25 PM Juvonen, Tomi (Nokia - FI/Espoo) 
<tomi.juvo...@nokia.com<mailto:tomi.juvo...@nokia.com>> wrote:
Hi,
Run Doctor test case in Nokia POD with APEX installer and state of the art 
Airframe HW. Modified the Doctor test case so that I can run several VMs and 
consumer can receive alarms from those. Measuring is it possible to stay within 
doctor requirement of under 1second from recognizing the fault to having alarm 
to consumer. This way I can see how much overhead comes when more VMs on 
failing host (overhead comes from calling reset server state API for each VM on 
failing host).

Here is how many milliseconds it took to get scenario trough:
With 1 VM on failing host: 180ms
With 10VMs where 5 VMs on failing host: 800ms to 1040ms
With 20VMs where 12 VMs on failing host: 2410ms
With 20VMs where 13 VMs on failing host: 2010ms
With 20VMs where 11 VMs on failing host: 2380ms
With 50VMs where 27 VMs on failing host: 5060ms
With 100VMs where 49 VMs on failing host: 8180ms

Conclusion: With ideal environment one can run 5 VMs on a host and still 
fulfill Doctor requirement. So this needs to be enhanced.
_______________________________________________
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss

Reply via email to