On 15 December 2015 at 10:03, Nikola Đipanov <[email protected]> wrote: > On 12/15/2015 03:33 AM, Cheng, Yingxin wrote: >> >>> -----Original Message----- >>> From: Nikola Đipanov [mailto:[email protected]] >>> Sent: Monday, December 14, 2015 11:11 PM >>> To: OpenStack Development Mailing List (not for usage questions) >>> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. >>> race >>> conditions)? >>> >>> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote: >>>> Hi All, >>>> >>>> >>>> >>>> When I was looking at bugs related to race conditions of scheduler >>>> [1-3], it feels like nova scheduler lacks sanity checks of schedule >>>> decisions according to different situations. We cannot even make sure >>>> that some fixes successfully mitigate race conditions to an acceptable >>>> scale. For example, there is no easy way to test whether server-group >>>> race conditions still exists after a fix for bug[1], or to make sure >>>> that after scheduling there will be no violations of allocation ratios >>>> reported by bug[2], or to test that the retry rate is acceptable in >>>> various corner cases proposed by bug[3]. And there will be much more >>>> in this list. >>>> >>>> >>>> >>>> So I'm asking whether there is a plan to add those tests in the >>>> future, or is there a design exist to simplify writing and executing >>>> those kinds of tests? I'm thinking of using fake databases and fake >>>> interfaces to isolate the entire scheduler service, so that we can >>>> easily build up a disposable environment with all kinds of fake >>>> resources and fake compute nodes to test scheduler behaviors. It is >>>> even a good way to test whether scheduler is capable to scale to 10k >>>> nodes without setting up 10k real compute nodes. >>>> >>> >>> This would be a useful effort - however do not assume that this is going to >>> be an >>> easy task. Even in the paragraph above, you fail to take into account that >>> in >>> order to test the scheduling you also need to run all compute services since >>> claims work like a kind of 2 phase commit where a scheduling decision gets >>> checked on the destination compute host (through Claims logic), which >>> involves >>> locking in each compute process. >>> >> >> Yes, the final goal is to test the entire scheduling process including 2PC. >> As scheduler is still in the process to be decoupled, some parts such as RT >> and retry mechanism are highly coupled with nova, thus IMO it is not a good >> idea to >> include them in this stage. Thus I'll try to isolate filter-scheduler as the >> first step, >> hope to be supported by community. >> >> >>>> >>>> >>>> I'm also interested in the bp[4] to reduce scheduler race conditions >>>> in green-thread level. I think it is a good start point in solving the >>>> huge racing problem of nova scheduler, and I really wish I could help on >>>> that. >>>> >>> >>> I proposed said blueprint but am very unlikely to have any time to work on >>> it this >>> cycle, so feel free to take a stab at it. I'd be more than happy to >>> prioritize any >>> reviews related to the above BP. >>> >>> Thanks for your interest in this >>> >>> N. >>> >> >> Many thanks nikola! I'm still looking at the claim logic and try to find a >> way to merge >> it with scheduler host state, will upload patches as soon as I figure it out. >> > > Great! > > Note that that step is not necessary - and indeed it may not be the best > place to start. We already have code duplication between the claims and > (what is only recently been renamed) consume_from_request, so removing > it is a nice to have but really not directly related to fixing the races. > > Also after Sylvain's work here https://review.openstack.org/#/c/191251/ > it will be trickoer to do as the scheduler side now used the RequestSpec > object instead of Instance, which is not sent over to compute nodes. > > I'd personally leave that for last.
I would recommend you attend the scheduler sub team meetings, if at all possible, or track what is discussed there: http://eavesdrop.openstack.org/#Nova_Scheduler_Team_Meeting There is a rough outline around the current direction of the scheduler work: http://docs.openstack.org/developer/nova/scheduler_evolution.html As ever, thats a little out of date right now, and doesn't capture all the discussions around moving claims into the scheduler. Thanks, johnthetubaguy > M. > >> >>>> >>>> >>>> >>>> >>>> [1] https://bugs.launchpad.net/nova/+bug/1423648 >>>> >>>> [2] https://bugs.launchpad.net/nova/+bug/1370207 >>>> >>>> [3] https://bugs.launchpad.net/nova/+bug/1341420 >>>> >>>> [4] >>>> https://blueprints.launchpad.net/nova/+spec/host-state-level-locking >>>> >>>> >>>> >>>> >>>> >>>> Regards, >>>> >>>> -Yingxin >>>> >> >> >> >> Regards, >> -Yingxin >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
