[1]
https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing
[2] https://review.openstack.org/#/c/271823/
Regards,
-Yingxin
*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 9:48 PM
*To:* OpenStack Development Mailing List (not for usage questions)
<openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation towards
the "shared state scheduler"
Le 15/02/2016 10:48, Cheng, Yingxin a écrit :
Thanks Sylvain,
1. The below ideas will be extended to a spec ASAP.
Nice, looking forward to it then :-)
2. Thanks for providing concerns I’ve not thought it yet, they will
be in the spec soon.
3. Let me copy my thoughts from another thread about the integration
with resource-provider:
The idea is about “Only compute node knows its own final
compute-node resource view” or “The accurate resource view only
exists at the place where it is actually consumed.” I.e., The
incremental updates can only come from the actual “consumption”
action, no matter where it is(e.g. compute node, storage service,
network service, etc.). Borrow the terms from resource-provider,
compute nodes can maintain its accurate version of
“compute-node-inventory” cache, and can send incremental updates
because it actually consumes compute resources, furthermore, storage
service can also maintain an accurate version of “storage-inventory”
cache and send incremental updates if it also consumes storage
resources. If there are central services in charge of consuming all
the resources, the accurate cache and updates must come from them.
That is one of the things I'd like to see in your spec, and how you
could interact with the new model.
Thanks,
-Sylvain
Regards,
-Yingxin
*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 5:28 PM
*To:* OpenStack Development Mailing List (not for usage questions)
<openstack-dev@lists.openstack.org>
<mailto:openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation
towards the "shared state scheduler"
Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,
I’ve uploaded a prototype
https://review.openstack.org/#/c/280047/ to testify its design
goals in accuracy, performance, reliability and compatibility
improvements. It will also be an Austin Summit Session if
elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton
release?
Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged
Ideally, I'd like to see your below ideas written in that spec file
so it would be the best way to discuss on the design.
2. Suggestions to improve its design and compatibility.
I don't want to go into details here (that's rather the goal of the
spec for that), but my biggest concerns would be when reviewing the
spec :
- how this can meet the OpenStack mission statement (ie.
ubiquitous solution that would be easy to install and massively
scalable)
- how this can be integrated with the existing (filters, weighers)
to provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes
sending updates to new scheduler)
- how can we test it
- can we have the feature optional for operators
3. Possibilities to integrate with resource-provider bp series:
I know resource-provider is the major direction of Nova
scheduler, and there will be fundamental changes in the future,
especially according to the bp
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
However, this prototype proposes a much faster and compatible
way to make schedule decisions based on scheduler caches. The
in-memory decisions are made at the same speed with the caching
scheduler, but the caches are kept consistent with compute nodes
as quickly as possible without db refreshing.
That's the key point, thanks for noticing our priorities. So, you
know that our resource modeling is drastically subject to change in
Mitaka and Newton. That is the new game, so I'd love to see how you
plan to interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
your ideas because all of you are having great ideas to improve a
current frustrating solution.
-Sylvain
Here is the detailed design of the mentioned prototype:
>>----------------------------
Background:
The host state cache maintained by host manager is the scheduler
resource view during schedule decision making. It is updated
whenever a request is received[1], and all the compute node
records are retrieved from db every time. There are several
problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by
db access in retrieving compute node records. The db block time
of a single request is 355ms in average in the deployment of 3
compute nodes, compared with only 3ms in in-memory
decision-making. Imagine there could be at most 1k nodes, even
10k nodes in the future.
2. Race conditions: This is not only a parallel-scheduler
problem, but also a problem using only one scheduler. The
detailed analysis of one-scheduler-problem is located in bug
analysis[2]. In short, there is a gap between the scheduler
makes a decision in host state cache and the
compute node updates its in-db resource record according to that
decision in resource tracker. A recent scheduler resource
consumption in cache can be lost and overwritten by compute node
data because of it, result in cache inconsistency and unexpected
retries. In a one-scheduler experiment using 3-node deployment,
there are 7 retries out of 31 concurrent schedule requests
recorded, results in 22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler
leads to an "even worse" performance result using parallel
schedulers. In the same experiment with 4 schedulers on separate
machines, the average db block time is increased to 697ms per
request and there are 16 retries out of 31 schedule requests,
namely 51.6% extra overhead.
Improvements:
This prototype solved the mentioned issues above by implementing
a new update model to scheduler host state cache. Instead of
refreshing caches from db, every compute node maintains its
accurate version of host state cache updated by the resource
tracker, and sends incremental updates directly to schedulers.
So the scheduler cache are synchronized to the correct state as
soon as possible with the lowest overhead. Also, scheduler will
send resource claim with its decision to the target compute
node. The compute node can decide whether the resource claim is
successful immediately by its local host state cache and send
responds back ASAP. With all the claims are tracked from
schedulers to compute nodes, no false overwrites will happen,
and thus the gaps between scheduler cache and real compute node
states are minimized. The benefits are obvious with recorded
experiments[3] compared with caching scheduler and filter
scheduler:
1. There is no db block time during scheduler decision making,
the average decision time per request is about 3ms in both
single and multiple scheduler scenarios, which is equal to the
in-memory decision time of filter scheduler and caching
scheduler.
2. Since the scheduler claims are tracked and the "false
overwrite" is eliminated, there should be 0 retries in
one-scheduler deployment, as proven in the experiment. Thanks to
the quick claim responding implementation, there are only 2
retries out of 31 requests in the 4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible
because the data structure of HostState is unchanged. In fact,
this prototype even supports filter scheduler running at the
same time(already tested). Like other operations with resource
changes such as migration, resizing or shelving, they make
claims in the resource tracker directly and update the compute
node host state immediately without major changes.
Extra features:
More efforts are made to better adjust the implementation to
real-world scenarios, such as network issues, service
unexpectedly down and overwhelming messages etc:
1. The communication between schedulers and compute nodes are
only casts, there are no RPC calls thus no blocks during
scheduling.
2. All updates from nodes to schedulers are labelled with an
incremental seed, so any message reordering, lost or duplication
due to network issues can be detected by MessageWindow
immediately. The inconsistent cache can be detected and
refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in
its async mode. There is no need to send all the messages one by
one in the MQ, they can be merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications
to all known remotes for quick cache synchronization, even
before the service record is available in db. And if a remote
service is unexpectedly down according to service group records,
no more messages will send to it. The ComputeFilter is also
removed because of this feature, the scheduler can detect remote
compute nodes by itself.
5. In fact the claim tracking is not only from schedulers to
compute nodes, but also from compute-node host state to the
resource tracker. One reason is that there is still a gap
between a claim is acknowledged by compute-node host state and
the claim is successful in resource tracker. It is necessary to
track those unhandled claims to keep host state accurate. The
second reason is to separate schedulers from compute node and
resource trackers. Scheduler only export limited interfaces
`update_from_compute` and `handle_rt_claim_failure` to compute
service and the RT, so the testing and reusing are easier with
clear boundaries.
TODOs:
There are still many features to be implemented, the most
important are unit tests and incremental updates to PCI and NUMA
resources, all of them are marked out inline.
References:
[1]
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
[3] http://paste.openstack.org/show/486929/
----------------------------<<
The original commit history of this prototype is located in
https://github.com/cyx1231st/nova/commits/shared-scheduler
For instructions to install and test this prototype, please
refer to the commit message of
https://review.openstack.org/#/c/280047/
Regards,
-Yingxin
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev