Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza Mon, 15 Feb 2016 01:33:06 -0800


Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

Hi,
I’ve uploaded a prototype https://review.openstack.org/#/c/280047/<https://review.openstack.org/#/c/280047/> to testify its design goalsin accuracy, performance, reliability and compatibility improvements.It will also be an Austin Summit Session if elected:https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
I want to gather opinions about this idea:

1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be writtenhttp://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file soit would be the best way to discuss on the design.

2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the specfor that), but my biggest concerns would be when reviewing the spec :- how this can meet the OpenStack mission statement (ie. ubiquitoussolution that would be easy to install and massively scalable)- how this can be integrated with the existing (filters, weighers) toprovide a clean and simple path for operators to upgrade- how this can be supporting rolling upgrades (old computes sendingupdates to new scheduler)

 - how can we test it
 - can we have the feature optional for operators

3. Possibilities to integrate with resource-provider bp series: I knowresource-provider is the major direction of Nova scheduler, and therewill be fundamental changes in the future, especially according to thebphttps://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.However, this prototype proposes a much faster and compatible way tomake schedule decisions based on scheduler caches. The in-memorydecisions are made at the same speed with the caching scheduler, butthe caches are kept consistent with compute nodes as quickly aspossible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you knowthat our resource modeling is drastically subject to change in Mitakaand Newton. That is the new game, so I'd love to see how you plan tointeract with that.Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could shareyour ideas because all of you are having great ideas to improve acurrent frustrating solution.


-Sylvain

Here is the detailed design of the mentioned prototype:

>>----------------------------

Background:
The host state cache maintained by host manager is the schedulerresource view during schedule decision making. It is updated whenevera request is received[1], and all the compute node records areretrieved from db every time. There are several problems in thisupdate model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by dbaccess in retrieving compute node records. The db block time of asingle request is 355ms in average in the deployment of 3 computenodes, compared with only 3ms in in-memory decision-making. Imaginethere could be at most 1k nodes, even 10k nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, butalso a problem using only one scheduler. The detailed analysis ofone-scheduler-problem is located in bug analysis[2]. In short, thereis a gap between the scheduler makes a decision in host state cacheand the
compute node updates its in-db resource record according to thatdecision in resource tracker. A recent scheduler resource consumptionin cache can be lost and overwritten by compute node data because ofit, result in cache inconsistency and unexpected retries. In aone-scheduler experiment using 3-node deployment, there are 7 retriesout of 31 concurrent schedule requests recorded, results in 22.6%extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads toan "even worse" performance result using parallel schedulers. In thesame experiment with 4 schedulers on separate machines, the average dbblock time is increased to 697ms per request and there are 16 retriesout of 31 schedule requests, namely 51.6% extra overhead.
Improvements:
This prototype solved the mentioned issues above by implementing a newupdate model to scheduler host state cache. Instead of refreshingcaches from db, every compute node maintains its accurate version ofhost state cache updated by the resource tracker, and sendsincremental updates directly to schedulers. So the scheduler cache aresynchronized to the correct state as soon as possible with the lowestoverhead. Also, scheduler will send resource claim with its decisionto the target compute node. The compute node can decide whether theresource claim is successful immediately by its local host state cacheand send responds back ASAP. With all the claims are tracked fromschedulers to compute nodes, no false overwrites will happen, and thusthe gaps between scheduler cache and real compute node states areminimized. The benefits are obvious with recorded experiments[3]compared with caching scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, theaverage decision time per request is about 3ms in both single andmultiple scheduler scenarios, which is equal to the in-memory decisiontime of filter scheduler and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" iseliminated, there should be 0 retries in one-scheduler deployment, asproven in the experiment. Thanks to the quick claim respondingimplementation, there are only 2 retries out of 31 requests in the4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible becausethe data structure of HostState is unchanged. In fact, this prototypeeven supports filter scheduler running at the same time(alreadytested). Like other operations with resource changes such asmigration, resizing or shelving, they make claims in the resourcetracker directly and update the compute node host state immediatelywithout major changes.
Extra features:
More efforts are made to better adjust the implementation toreal-world scenarios, such as network issues, service unexpectedlydown and overwhelming messages etc:
1. The communication between schedulers and compute nodes are onlycasts, there are no RPC calls thus no blocks during scheduling.
2. All updates from nodes to schedulers are labelled with anincremental seed, so any message reordering, lost or duplication dueto network issues can be detected by MessageWindow immediately. Theinconsistent cache can be detected and refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in itsasync mode. There is no need to send all the messages one by one inthe MQ, they can be merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications toall known remotes for quick cache synchronization, even before theservice record is available in db. And if a remote service isunexpectedly down according to service group records, no more messageswill send to it. The ComputeFilter is also removed because of thisfeature, the scheduler can detect remote compute nodes by itself.
5. In fact the claim tracking is not only from schedulers to computenodes, but also from compute-node host state to the resource tracker.One reason is that there is still a gap between a claim isacknowledged by compute-node host state and the claim is successful inresource tracker. It is necessary to track those unhandled claims tokeep host state accurate. The second reason is to separate schedulersfrom compute node and resource trackers. Scheduler only export limitedinterfaces `update_from_compute` and `handle_rt_claim_failure` tocompute service and the RT, so the testing and reusing are easier withclear boundaries.
TODOs:
There are still many features to be implemented, the most importantare unit tests and incremental updates to PCI and NUMA resources, allof them are marked out inline.
References:
[1]https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24<https://bugs.launchpad.net/nova/+bug/1341420/comments/24>
[3] http://paste.openstack.org/show/486929/

----------------------------<<
The original commit history of this prototype is located inhttps://github.com/cyx1231st/nova/commits/shared-scheduler<https://github.com/cyx1231st/nova/commits/shared-scheduler>
For instructions to install and test this prototype, please refer tothe commit message of https://review.openstack.org/#/c/280047/
Regards,

-Yingxin



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to