On Tue, Mar 22, 2016 at 9:31 AM, Kevin Benton <[email protected]> wrote: > Thanks for doing this. I dug into the test_volume_boot_pattern test to see > what was going on. > > On the first boot, Nova called Neutron to create a port at 23:29:44 and it > took 441ms to return the port to Nova.[1] > Nova then plugged the interface for that port into OVS a little over 6 > seconds later at 23:29:50.[2] > The Neutron agent attempted to process this on the iteration at 23:29:52 > [3]; however, it didn't get the ofport populated from the OVSDB monitor... a > bug![4] The Neutron agent did catch it on the next iteration two seconds > later on a retry and notified the Neutron server at 23:29:54.[5]
Good work as usual Kevin, just approved the fix to this bug. > The Neutron server processed the port ACTIVE change in just under 80ms[6], > but it did not dispatch the notification to Nova until 2 seconds later at > 23:29:56 [7] due to the Nova notification batching mechanism[8]. > > Total time between port create and boot is about 12 seconds. 6 in Nova and 6 > in Neutron. > > For the Neutron side, the bug fix should eliminate 2 seconds. We could > probably make the Nova notifier batching mechanism a little more aggressive > so it only batches up calls in a very short interval rather than making 2 > second buckets at all times. The remaining 2 seconds is just the agent > processing loop interval, which can be tuned with a config but it should be > okay if that's the only bottleneck. > > For Nova, we need to improve that 6 seconds after it has created the Neutron > port before it has plugged it into the vswitch. I can see it makes some > other calls to Neutron in this time to list security groups and floating > IPs. Maybe this can be done asynchronously because I don't think they should > block the initial VM boot to pause that plugs in the VIF. > > Completely unrelated to the boot process, the entire tempest run spent ~412 > seconds building and destroying Neutron resources in setup and teardown.[9] > However, considering the number of tests executed, this seems reasonable so > I'm not sure we need to work on optimizing that yet. > > > Cheers, > Kevin Benton > > > 1. > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_45_341 > 2. > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-n-cpu.txt.gz#_2016-03-21_23_29_50_629 > 3. > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_52_216 > 4. https://bugs.launchpad.net/neutron/+bug/1560464 > 5. > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_54_738 > 6. > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_54_813 > 7. > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_56_782 > 8. > http://git.openstack.org/cgit/openstack/neutron/tree/neutron/notifiers/nova.py > 9. egrep -R 'tearDown|setUp' tempest.txt.gz | grep 9696 | awk '{print > $(NF)}' | ./fsum > > On Mon, Mar 21, 2016 at 5:09 PM, Clark Boylan <[email protected]> wrote: >> >> On Mon, Mar 21, 2016, at 01:23 PM, Sean Dague wrote: >> > On 03/21/2016 04:09 PM, Clark Boylan wrote: >> > > On Mon, Mar 21, 2016, at 11:49 AM, Clark Boylan wrote: >> > >> On Mon, Mar 21, 2016, at 11:08 AM, Clark Boylan wrote: >> > >>> On Mon, Mar 21, 2016, at 09:32 AM, Armando M. wrote: >> > >>>> Do you have an a better insight of job runtimes vs jobs in other >> > >>>> projects? >> > >>>> Most of the time in the job runtime is actually spent setting the >> > >>>> infrastructure up, and I am not sure we can do anything about it, >> > >>>> unless >> > >>>> we >> > >>>> take this with Infra. >> > >>> >> > >>> I haven't done a comparison yet buts lets break down the runtime of >> > >>> a >> > >>> recent successful neutron full run against neutron master [0]. >> > >> >> > >> And now for some comparative data from the gate-tempest-dsvm-full job >> > >> [0]. This job also ran against a master change that merged and ran in >> > >> the same cloud and region as the neutron job. >> > >> >> > > snip >> > >> Generally each step of this job was quicker. There were big >> > >> differences >> > >> in devstack and tempest run time though. Is devstack much slower to >> > >> setup neutron when compared to nova net? For tempest it looks like we >> > >> run ~1510 tests against neutron and only ~1269 against nova net. This >> > >> may account for the large difference there. I also recall that we run >> > >> ipv6 tempest tests against neutron deployments that were inefficient >> > >> and >> > >> booted 2 qemu VMs per test (not sure if that is still the case but >> > >> illustrates that the tests themselves may not be very quick in the >> > >> neutron case). >> > > >> > > Looking at the tempest slowest tests output for each of these jobs >> > > (neutron and nova net) some tests line up really well across jobs and >> > > others do not. In order to get a better handle on the runtime for >> > > individual tests I have pushed https://review.openstack.org/295487 >> > > which >> > > will run tempest serially reducing the competition for resources >> > > between >> > > tests. >> > > >> > > Hopefully the subunit logs generated by this change can provide more >> > > insight into where we are losing time during the tempest test runs. >> >> The results are in, we have gate-tempest-dsvm-full [0] and >> gate-tempest-dsvm-neutron-full [1] job results where tempest ran >> serially to reduce resource contention and provide accurateish per test >> timing data. Both of these jobs ran on the same cloud so should have >> comparable performance from the underlying VMs. >> >> gate-tempest-dsvm-full >> Time spent in job before tempest: 700 seconds >> Time spent running tempest: 2428 >> Tempest tests run: 1269 (113 skipped) >> >> gate-tempest-dsvm-neutron-full >> Time spent in job before tempest: 789 seconds >> Time spent running tempest: 4407 seconds >> Tempest tests run: 1510 (76 skipped) >> >> All times above are wall time as recorded by Jenkins. >> >> We can also compare the 10 slowest tests in the non neutron job against >> their runtimes in the neutron job. (note this isn't a list of the top 10 >> slowest tests in the neutron job because that job runs extra tests). >> >> nova net job >> >> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern >> 85.232 >> >> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern >> 83.319 >> >> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance >> 50.338 >> >> tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern >> 43.494 >> >> tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario >> 40.225 >> >> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance >> 39.653 >> >> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete >> 37.720 >> >> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete >> 36.355 >> >> tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped >> 27.375 >> >> tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks >> 27.025 >> >> neutron job >> >> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern >> 110.345 >> >> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern >> 108.170 >> >> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance >> 63.852 >> >> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance >> 59.931 >> >> tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern >> 57.835 >> >> tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario >> 49.552 >> >> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete >> 40.378 >> >> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete >> 39.088 >> >> tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks >> 35.645 >> >> tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped >> 30.551 >> >> > Subunit logs aren't the full story here. Activity in addCleanup doesn't >> > get added to the subunit time accounting for the test, which causes some >> > interesting issues when waiting for resources to delete. I would be >> > especially cautious of that on some of these. >> >> Based on this those numbers above may not tell the whole story but they >> do seem to tell us that in comparable circumstances neutron is slower >> than nova net. Now the sample size is tiny, but again it gives us >> somewhere to start. What is boot from volume doing in the neutron case >> that makes it so much slower? Why is shelving so much slower with >> neutron? and so on. >> >> A few seconds here and a few seconds there adds up when these operations >> are repeated a few hundred times. We can probably start to whittle the >> job runtime down by shaving that extra time off. In any case I think >> this is about as far as I can pull this thread, and will let the neutron >> team take it from here. >> >> [0] >> >> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-full/8e64615/console.html >> [1] >> >> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/console.html >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
