Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,
I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
<https://review.openstack.org/#/c/280047/> to testify its design goals
in accuracy, performance, reliability and compatibility improvements.
It will also be an Austin Summit Session if elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?
Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged
Ideally, I'd like to see your below ideas written in that spec file so
it would be the best way to discuss on the design.
2. Suggestions to improve its design and compatibility.
I don't want to go into details here (that's rather the goal of the spec
for that), but my biggest concerns would be when reviewing the spec :
- how this can meet the OpenStack mission statement (ie. ubiquitous
solution that would be easy to install and massively scalable)
- how this can be integrated with the existing (filters, weighers) to
provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes sending
updates to new scheduler)
- how can we test it
- can we have the feature optional for operators
3. Possibilities to integrate with resource-provider bp series: I know
resource-provider is the major direction of Nova scheduler, and there
will be fundamental changes in the future, especially according to the
bp
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
However, this prototype proposes a much faster and compatible way to
make schedule decisions based on scheduler caches. The in-memory
decisions are made at the same speed with the caching scheduler, but
the caches are kept consistent with compute nodes as quickly as
possible without db refreshing.
That's the key point, thanks for noticing our priorities. So, you know
that our resource modeling is drastically subject to change in Mitaka
and Newton. That is the new game, so I'd love to see how you plan to
interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
your ideas because all of you are having great ideas to improve a
current frustrating solution.
-Sylvain
Here is the detailed design of the mentioned prototype:
>>----------------------------
Background:
The host state cache maintained by host manager is the scheduler
resource view during schedule decision making. It is updated whenever
a request is received[1], and all the compute node records are
retrieved from db every time. There are several problems in this
update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db
access in retrieving compute node records. The db block time of a
single request is 355ms in average in the deployment of 3 compute
nodes, compared with only 3ms in in-memory decision-making. Imagine
there could be at most 1k nodes, even 10k nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but
also a problem using only one scheduler. The detailed analysis of
one-scheduler-problem is located in bug analysis[2]. In short, there
is a gap between the scheduler makes a decision in host state cache
and the
compute node updates its in-db resource record according to that
decision in resource tracker. A recent scheduler resource consumption
in cache can be lost and overwritten by compute node data because of
it, result in cache inconsistency and unexpected retries. In a
one-scheduler experiment using 3-node deployment, there are 7 retries
out of 31 concurrent schedule requests recorded, results in 22.6%
extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to
an "even worse" performance result using parallel schedulers. In the
same experiment with 4 schedulers on separate machines, the average db
block time is increased to 697ms per request and there are 16 retries
out of 31 schedule requests, namely 51.6% extra overhead.
Improvements:
This prototype solved the mentioned issues above by implementing a new
update model to scheduler host state cache. Instead of refreshing
caches from db, every compute node maintains its accurate version of
host state cache updated by the resource tracker, and sends
incremental updates directly to schedulers. So the scheduler cache are
synchronized to the correct state as soon as possible with the lowest
overhead. Also, scheduler will send resource claim with its decision
to the target compute node. The compute node can decide whether the
resource claim is successful immediately by its local host state cache
and send responds back ASAP. With all the claims are tracked from
schedulers to compute nodes, no false overwrites will happen, and thus
the gaps between scheduler cache and real compute node states are
minimized. The benefits are obvious with recorded experiments[3]
compared with caching scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, the
average decision time per request is about 3ms in both single and
multiple scheduler scenarios, which is equal to the in-memory decision
time of filter scheduler and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" is
eliminated, there should be 0 retries in one-scheduler deployment, as
proven in the experiment. Thanks to the quick claim responding
implementation, there are only 2 retries out of 31 requests in the
4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible because
the data structure of HostState is unchanged. In fact, this prototype
even supports filter scheduler running at the same time(already
tested). Like other operations with resource changes such as
migration, resizing or shelving, they make claims in the resource
tracker directly and update the compute node host state immediately
without major changes.
Extra features:
More efforts are made to better adjust the implementation to
real-world scenarios, such as network issues, service unexpectedly
down and overwhelming messages etc:
1. The communication between schedulers and compute nodes are only
casts, there are no RPC calls thus no blocks during scheduling.
2. All updates from nodes to schedulers are labelled with an
incremental seed, so any message reordering, lost or duplication due
to network issues can be detected by MessageWindow immediately. The
inconsistent cache can be detected and refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in its
async mode. There is no need to send all the messages one by one in
the MQ, they can be merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications to
all known remotes for quick cache synchronization, even before the
service record is available in db. And if a remote service is
unexpectedly down according to service group records, no more messages
will send to it. The ComputeFilter is also removed because of this
feature, the scheduler can detect remote compute nodes by itself.
5. In fact the claim tracking is not only from schedulers to compute
nodes, but also from compute-node host state to the resource tracker.
One reason is that there is still a gap between a claim is
acknowledged by compute-node host state and the claim is successful in
resource tracker. It is necessary to track those unhandled claims to
keep host state accurate. The second reason is to separate schedulers
from compute node and resource trackers. Scheduler only export limited
interfaces `update_from_compute` and `handle_rt_claim_failure` to
compute service and the RT, so the testing and reusing are easier with
clear boundaries.
TODOs:
There are still many features to be implemented, the most important
are unit tests and incremental updates to PCI and NUMA resources, all
of them are marked out inline.
References:
[1]
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
<https://bugs.launchpad.net/nova/+bug/1341420/comments/24>
[3] http://paste.openstack.org/show/486929/
----------------------------<<
The original commit history of this prototype is located in
https://github.com/cyx1231st/nova/commits/shared-scheduler
<https://github.com/cyx1231st/nova/commits/shared-scheduler>
For instructions to install and test this prototype, please refer to
the commit message of https://review.openstack.org/#/c/280047/
Regards,
-Yingxin
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev