Thanks Boris, the idea is quite similar in “Do not have db accesses during 
scheduler decision making” because db blocks are introduced at the same time, 
this is very bad for the lock-free design of nova scheduler.

Another important idea is that “Only compute node knows its own final 
compute-node resource view” or “The accurate resource view only exists at the 
place where it is actually consumed.” I.e., The incremental updates can only 
come from the actual “consumption” action, no matter where it is(e.g. compute 
node, storage service, network service, etc.). Borrow the terms from 
resource-provider, compute nodes can maintain its accurate version of 
“compute-node-inventory” cache, and can send incremental updates because it 
actually consumes compute resources, furthermore, storage service can also 
maintain an accurate version of “storage-inventory” cache and send incremental 
updates if it also consumes storage resources. If there are central services in 
charge of consuming all the resources, the accurate cache and updates must come 
from them.

The third idea is “compatibility”. This prototype focuses on a very small scope 
by only introducing a new host_manager driver “shared_host_manager” with minor 
other changes. The driver can be changed back to “host_manager” very easily. It 
can also run with filter schedulers and caching schedulers. Most importantly, 
the filtering and weighing algorithms are kept unchanged. So more changes can 
be introduced for the complete version of “shared state scheduler” because it 
is evolving in a gradual way.


Regards,
-Yingxin

From: Boris Pavlovic [mailto:bo...@pavlovic.me]
Sent: Monday, February 15, 2016 1:59 PM
To: OpenStack Development Mailing List (not for usage questions) 
<openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"

Yingxin,

This looks quite similar to the work of this bp:
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler

It's really nice that somebody is still trying to push scheduler refactoring in 
this way.
Thanks.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin 
<yingxin.ch...@intel.com<mailto:yingxin.ch...@intel.com>> wrote:
Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to testify 
its design goals in accuracy, performance, reliability and compatibility 
improvements. It will also be an Austin Summit Session if elected: 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?
2. Suggestions to improve its design and compatibility.
3. Possibilities to integrate with resource-provider bp series: I know 
resource-provider is the major direction of Nova scheduler, and there will be 
fundamental changes in the future, especially according to the bp 
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
 However, this prototype proposes a much faster and compatible way to make 
schedule decisions based on scheduler caches. The in-memory decisions are made 
at the same speed with the caching scheduler, but the caches are kept 
consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:

>>----------------------------
Background:
The host state cache maintained by host manager is the scheduler resource view 
during schedule decision making. It is updated whenever a request is 
received[1], and all the compute node records are retrieved from db every time. 
There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in 
retrieving compute node records. The db block time of a single request is 355ms 
in average in the deployment of 3 compute nodes, compared with only 3ms in 
in-memory decision-making. Imagine there could be at most 1k nodes, even 10k 
nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a 
problem using only one scheduler. The detailed analysis of 
one-scheduler-problem is located in bug analysis[2]. In short, there is a gap 
between the scheduler makes a decision in host state cache and the compute node 
updates its in-db resource record according to that decision in resource 
tracker. A recent scheduler resource consumption in cache can be lost and 
overwritten by compute node data because of it, result in cache inconsistency 
and unexpected retries. In a one-scheduler experiment using 3-node deployment, 
there are 7 retries out of 31 concurrent schedule requests recorded, results in 
22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even 
worse" performance result using parallel schedulers. In the same experiment 
with 4 schedulers on separate machines, the average db block time is increased 
to 697ms per request and there are 16 retries out of 31 schedule requests, 
namely 51.6% extra overhead.

Improvements:
This prototype solved the mentioned issues above by implementing a new update 
model to scheduler host state cache. Instead of refreshing caches from db, 
every compute node maintains its accurate version of host state cache updated 
by the resource tracker, and sends incremental updates directly to schedulers. 
So the scheduler cache are synchronized to the correct state as soon as 
possible with the lowest overhead. Also, scheduler will send resource claim 
with its decision to the target compute node. The compute node can decide 
whether the resource claim is successful immediately by its local host state 
cache and send responds back ASAP. With all the claims are tracked from 
schedulers to compute nodes, no false overwrites will happen, and thus the gaps 
between scheduler cache and real compute node states are minimized. The 
benefits are obvious with recorded experiments[3] compared with caching 
scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, the average 
decision time per request is about 3ms in both single and multiple scheduler 
scenarios, which is equal to the in-memory decision time of filter scheduler 
and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" is 
eliminated, there should be 0 retries in one-scheduler deployment, as proven in 
the experiment. Thanks to the quick claim responding implementation, there are 
only 2 retries out of 31 requests in the 4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible because the data 
structure of HostState is unchanged. In fact, this prototype even supports 
filter scheduler running at the same time(already tested). Like other 
operations with resource changes such as migration, resizing or shelving, they 
make claims in the resource tracker directly and update the compute node host 
state immediately without major changes.

Extra features:
More efforts are made to better adjust the implementation to real-world 
scenarios, such as network issues, service unexpectedly down and overwhelming 
messages etc:
1. The communication between schedulers and compute nodes are only casts, there 
are no RPC calls thus no blocks during scheduling.
2. All updates from nodes to schedulers are labelled with an incremental seed, 
so any message reordering, lost or duplication due to network issues can be 
detected by MessageWindow immediately. The inconsistent cache can be detected 
and refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in its async mode. 
There is no need to send all the messages one by one in the MQ, they can be 
merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications to all known 
remotes for quick cache synchronization, even before the service record is 
available in db. And if a remote service is unexpectedly down according to 
service group records, no more messages will send to it. The ComputeFilter is 
also removed because of this feature, the scheduler can detect remote compute 
nodes by itself.
5. In fact the claim tracking is not only from schedulers to compute nodes, but 
also from compute-node host state to the resource tracker. One reason is that 
there is still a gap between a claim is acknowledged by compute-node host state 
and the claim is successful in resource tracker. It is necessary to track those 
unhandled claims to keep host state accurate. The second reason is to separate 
schedulers from compute node and resource trackers. Scheduler only export 
limited interfaces `update_from_compute` and `handle_rt_claim_failure` to 
compute service and the RT, so the testing and reusing are easier with clear 
boundaries.

TODOs:
There are still many features to be implemented, the most important are unit 
tests and incremental updates to PCI and NUMA resources, all of them are marked 
out inline.

References:
[1] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
[3] http://paste.openstack.org/show/486929/
----------------------------<<

The original commit history of this prototype is located in 
https://github.com/cyx1231st/nova/commits/shared-scheduler
For instructions to install and test this prototype, please refer to the commit 
message of https://review.openstack.org/#/c/280047/


Regards,
-Yingxin


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to