Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza Mon, 22 Feb 2016 01:29:09 -0800

So, that thread is becoming hard to follow, but I'll try to reply.



Le 21/02/2016 20:56, Jay Pipes a écrit :

Yingxin, sorry for the delay in responding to this thread. My commentsinline.
On 02/17/2016 12:45 AM, Cheng, Yingxin wrote:
To better illustrate the differences between shared-state,
resource-provider and legacy scheduler, I’ve drew 3 simplified pictures
[1] in emphasizing the location of resource view, the location of claim
and resource consumption, and the resource update/refresh pattern in
three kinds of schedulers. Hoping I’m correct in the “resource-provider
scheduler” part.
No, the diagram is not correct for the resource-provider scheduler.

Problems with your depiction of the resource-provider scheduler:
1) There is no proposed cache at all in the resource-providerscheduler so all the arrows for "cache refresh" can be eliminated.
2) Claims of resource amounts are done in a database transactionatomically within each scheduler process. Therefore there are no"cache updates" arrows going back from compute nodes to theresource-provider DB. The only time a compute node would communicatewith the resource-provider DB (and thus the scheduler at all) would bein the case of a *failed* attempt to initialize already-claimedresources.
A point of view from my analysis in comparing three schedulers (before
real experiment):

1. Performance: The performance bottlehead of resource-provider and
legacy scheduler is from the centralized db and scheduler cache
refreshing.
You must first prove that there is a bottleneck with theresource-provider scheduler.

I won't argue against performance here. You made a very nice PoC fortesting scaling DB writes within a single python process and I trustyour findings. While I would be naturally preferring some shared-nothingapproach that can horizontally scale, one could mention that we can dothe same with Galera clusters.That said, most of the operators run a controller/compute situationwhere all the services but the compute node are hosted on 1:N hosts.Implementing the resource-providers-scheduler BP (and only that one)will dramatically increase the number of writes we do on the schedulerprocess (ie. on the "controller" - quoting because there is no notion ofa "controller" in Nova, it's just a deployment choice).

That's a big game changer for operators who are currently capping theircapacity by adding more conductors. It would require them to do some DBmodifications to be able to scale their capacity. I'm not against that,I just say it's a big thing that we need to consider and properlycommunicate if agreed.

> It can be alleviated by changing to a stand-alone high
 performance database.
It doesn't need to be high-performance at all. In my benchmarks, asmall-sized stock MySQL database server is able to fulfill thousandsof placement queries and claim transactions per minute usingcompletely isolated non-shared, non-caching scheduler processes.
> And the cache refreshing is designed to be
replaced by to direct SQL queries according to resource-provider
scheduler spec [2].
Yes, this is correct.

> The performance bottlehead of shared-state scheduler
may come from the overwhelming update messages, it can also be
alleviated by changing to stand-alone distributed message queue and by
using the “MessagePipe” to merge messages.
In terms of the number of messages used in each design, I see thefollowing relationship:
resource-providers < legacy < shared-state-scheduler

would you agree with that?

True. But that's manageable by adding more conductors, right ? IMHO,Nova performance is bound by the number of conductors you run and I likethat - because that's easy to increase the capacity.Also, the payload could be far smaller from the existing : instead ofsending a full update for a single compute_node entry, it would onlysend the diff (+ some full syncs periodically). We would then mitigatethe messages increase by making sure we're sending less per message.

The resource-providers proposal actually uses no update messages atall (except in the abnormal case of a compute node failing to startthe resources that had previously been claimed by the scheduler). Allupdates are done in a single database transaction when the claim is made.

See, I don't think that a compute node unable to start a request is an'abnormal case'. There are many reasons why a request can't be honoredby the compute node :- for example, the scheduler doesn't own all the compute resources andthus can miss some information : for example, say that you want to pin aspecific pCPU and this pCPU is already assigned. The scheduler doesn'tknow *which* pCPUs are free, it only knows *how much* are freeThat atomic transaction (pick a free pCPU and assign it to the instance)is made on the compute manager not at the exact same time we'redecreasing resource usage for pCPUs (because it would be done in thescheduler process).- some "reserved" RAM or disk could be underestimated andconsequently, spawning a VM could be either taking fare more time thanplanned (which would mean that it would be a suboptimal placement) or itwould fail which would issue a reschedule.

The legacy scheduler has each compute node sending an update message(actually it's a database update in the form of ComputeNode.save()that is done at the completion of the localnova.compute.claims.Claim() context manager. In addition to theseupdate messages, the legacy scheduler has a problem with retries(because the scheduler operates on non-fresh data when there are morethan one scheduler process and they both make the same placementdecision).
The shared-state scheduler has the most amount of update messages. Itsends an update message to each scheduler in the system every timeanything at all happens on the compute node, in addition to messagesinvolving claims -- sending, confirming and timing them out -- all ofwhich affect each scheduler process' state cache.
2. Final decision accuracy: I think the accuracy of the final decision
are high in all three schedulers, because until now the consistent
resource view and the final resource consumption with claims are all in
the same place. It’s resource trackers in shared-state scheduler and
legacy scheduler, and it’s the resource-provider db in resource-provider
scheduler.
Agreed, I don't believe the final decision accuracy will be affectedmuch by the three designs. It's the speed by which the decision can bereached and the concurrency at which placement decisions can be madethat are the differing metrics we are measuring.
3. Scheduler decision accuracy: IMO the order of accuracy of a single
schedule decision is resource-provider > shared-state >> legacy
scheduler. The resource-provider scheduler can get the accurate resource
view directly from db. Shared-state scheduler is getting the most
accurate resource view by constantly collecting updates from resource
trackers and by tracking the scheduler claims from schedulers to RTs.
Legacy scheduler’s decision is the worst because it doesn’t track its
claims and get resource views from compute nodes records which are not
that accurate.
I don't see how the shared-state scheduler is getting the mostaccurate resource view. It is only in extreme circumstances that theresource-provider scheduler's view of the resources in a system (allof which is stored without caching in the database) would differ fromthe "actual" inventory on a compute node.

It's a distributed problem. Neither the shared-state scheduler can havean accurate view (because it will only guarantee that it will*eventually* be consistent), nor the scheduler process can have anaccurate view (because the scheduler doesn't own the resource usage thatis made by compute nodes)

4. Design goal difference:

The fundamental design goal of the two new schedulers is different. Copy
my views from [2], I think it is the choice between “the loose
distributed consistency with retries” and “the strict centralized
consistency with locks”.
There are a couple other things that I believe we should bedocumenting, considering and measuring with regards to scheduler designs:
a) Debuggability
The ability of a system to be debugged and for requests to that systemto be diagnosed is a critical component to the benefit of a particularsystem design. I'm hoping that by removing a lot of the moving partsfrom the legacy filter scheduler design (removing the caching,removing the Python-side filtering and weighing, removing the intervalbetween which placement decisions can conflict, removing the cost andfrequency of retry operations) that the resource-provider schedulerdesign will become simpler for operators to use.

Keep in mind that scheduler filters are a very handy pluggable systemfor operators wanting to implement their own placement logic. If youwant to deprecate filters (and I'm not opiniated on that), just makesure that you keep that extension capability.

b) Simplicity
Goes to the above point about debuggability, but I've always tried tofollow the mantra that the best software design is not when you'veadded the last piece to it, but rather when you've removed the lastpiece from it and still have a functioning and performant system.Having a scheduler that can tackle the process of tracking resources,deciding on placement, and claiming those resources instead of playingan intricate dance of keeping state caches valid will, IMHO, lead to abetter scheduler.

I'm also very concerned by the upgrade path for both proposals like Isaid before. I haven't yet seen how either of those 2 proposals arechanging the existing FilterScheduler and what is subject to change.Also, please keep in mind that we support rolling upgrades, so we needto support old compute nodes.

As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult to
make shared-state and resource-provider scheduler work together.

Yes, I don't see the two approaches being particularly compatible forthe reason you state above.

That said, what we've discussed is having a totally new schedulerRESTful API that would do the claims in the scheduler(claim_resources()) and leave the existing select_destinations() callas-is to allow some deprecation and fallback if everything goesterribly, terribly wrong.


Best,
-jay

[1]

https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing


[2] https://review.openstack.org/#/c/271823/

Regards,

-Yingxin

*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 9:48 PM
*To:* OpenStack Development Mailing List (not for usage questions)
<openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation towards
the "shared state scheduler"

Le 15/02/2016 10:48, Cheng, Yingxin a écrit :

    Thanks Sylvain,

    1. The below ideas will be extended to a spec ASAP.


Nice, looking forward to it then :-)

    2. Thanks for providing concerns I’ve not thought it yet, they will
    be in the spec soon.

    3. Let me copy my thoughts from another thread about the integration
    with resource-provider:

    The idea is about “Only compute node knows its own final
    compute-node resource view” or “The accurate resource view only
    exists at the place where it is actually consumed.” I.e., The
    incremental updates can only come from the actual “consumption”
    action, no matter where it is(e.g. compute node, storage service,
    network service, etc.). Borrow the terms from resource-provider,
    compute nodes can maintain its accurate version of
    “compute-node-inventory” cache, and can send incremental updates
    because it actually consumes compute resources, furthermore, storage
    service can also maintain an accurate version of “storage-inventory”
    cache and send incremental updates if it also consumes storage
    resources. If there are central services in charge of consuming all
    the resources, the accurate cache and updates must come from them.


That is one of the things I'd like to see in your spec, and how you
could interact with the new model.
Thanks,
-Sylvain



    Regards,

    -Yingxin

    *From:*Sylvain Bauza [mailto:sba...@redhat.com]
    *Sent:* Monday, February 15, 2016 5:28 PM
    *To:* OpenStack Development Mailing List (not for usage questions)
    <openstack-dev@lists.openstack.org>
    <mailto:openstack-dev@lists.openstack.org>
    *Subject:* Re: [openstack-dev] [nova] A prototype implementation
    towards the "shared state scheduler"

    Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

        Hi,

        I’ve uploaded a prototype
        https://review.openstack.org/#/c/280047/ to testify its design
        goals in accuracy, performance, reliability and compatibility
        improvements. It will also be an Austin Summit Session if
        elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316


        I want to gather opinions about this idea:

1. Is this feature possible to be accepted in the Newtonrelease?

Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file
so it would be the best way to discuss on the design.

2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the
spec for that), but my biggest concerns would be when reviewing the
spec :
- how this can meet the OpenStack mission statement (ie.
ubiquitous solution that would be easy to install and massively
scalable)
- how this can be integrated with the existing (filters, weighers)
to provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes
sending updates to new scheduler)
- how can we test it
- can we have the feature optional for operators

3. Possibilities to integrate with resource-provider bp series:
I know resource-provider is the major direction of Nova
scheduler, and there will be fundamental changes in the future,
especially according to the bp
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
However, this prototype proposes a much faster and compatible
way to make schedule decisions based on scheduler caches. The
in-memory decisions are made at the same speed with the caching
scheduler, but the caches are kept consistent with compute nodes
as quickly as possible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you
know that our resource modeling is drastically subject to change in
Mitaka and Newton. That is the new game, so I'd love to see how you
plan to interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
your ideas because all of you are having great ideas to improve a
current frustrating solution.

-Sylvain

Here is the detailed design of the mentioned prototype:

>>----------------------------

Background:

The host state cache maintained by host manager is the scheduler
resource view during schedule decision making. It is updated
whenever a request is received[1], and all the compute node
records are retrieved from db every time. There are several
problems in this update model, proven in experiments[3]:

1. Performance: The scheduler performance is largely affected by
db access in retrieving compute node records. The db block time
of a single request is 355ms in average in the deployment of 3
compute nodes, compared with only 3ms in in-memory
decision-making. Imagine there could be at most 1k nodes, even
10k nodes in the future.

2. Race conditions: This is not only a parallel-scheduler
problem, but also a problem using only one scheduler. The
detailed analysis of one-scheduler-problem is located in bug
analysis[2]. In short, there is a gap between the scheduler
makes a decision in host state cache and the

compute node updates its in-db resource record according to that
decision in resource tracker. A recent scheduler resource
consumption in cache can be lost and overwritten by compute node
data because of it, result in cache inconsistency and unexpected
retries. In a one-scheduler experiment using 3-node deployment,
there are 7 retries out of 31 concurrent schedule requests
recorded, results in 22.6% extra performance overhead.

3. Parallel scheduler support: The design of filter scheduler
leads to an "even worse" performance result using parallel
schedulers. In the same experiment with 4 schedulers on separate
machines, the average db block time is increased to 697ms per
request and there are 16 retries out of 31 schedule requests,
namely 51.6% extra overhead.

Improvements:

This prototype solved the mentioned issues above by implementing
a new update model to scheduler host state cache. Instead of
refreshing caches from db, every compute node maintains its
accurate version of host state cache updated by the resource
tracker, and sends incremental updates directly to schedulers.
So the scheduler cache are synchronized to the correct state as
soon as possible with the lowest overhead. Also, scheduler will
send resource claim with its decision to the target compute
node. The compute node can decide whether the resource claim is
successful immediately by its local host state cache and send
responds back ASAP. With all the claims are tracked from
schedulers to compute nodes, no false overwrites will happen,
and thus the gaps between scheduler cache and real compute node
states are minimized. The benefits are obvious with recorded

experiments[3] compared with caching scheduler and filterscheduler:


        1. There is no db block time during scheduler decision making,
        the average decision time per request is about 3ms in both
        single and multiple scheduler scenarios, which is equal to the

in-memory decision time of filter scheduler and cachingscheduler.


        2. Since the scheduler claims are tracked and the "false
        overwrite" is eliminated, there should be 0 retries in
        one-scheduler deployment, as proven in the experiment. Thanks to
        the quick claim responding implementation, there are only 2
        retries out of 31 requests in the 4-scheduler experiment.

        3. All the filtering and weighing algorithms are compatible
        because the data structure of HostState is unchanged. In fact,
        this prototype even supports filter scheduler running at the
        same time(already tested). Like other operations with resource
        changes such as migration, resizing or shelving, they make
        claims in the resource tracker directly and update the compute
        node host state immediately without major changes.

        Extra features:

        More efforts are made to better adjust the implementation to
        real-world scenarios, such as network issues, service
        unexpectedly down and overwhelming messages etc:

        1. The communication between schedulers and compute nodes are

only casts, there are no RPC calls thus no blocks duringscheduling.


        2. All updates from nodes to schedulers are labelled with an
        incremental seed, so any message reordering, lost or duplication
        due to network issues can be detected by MessageWindow
        immediately. The inconsistent cache can be detected and
        refreshed correctly.

        3. The overwhelming messages are compressed by MessagePipe in
        its async mode. There is no need to send all the messages one by
        one in the MQ, they can be merged before sent to schedulers.

        4. When a new service is up or recovered, it sends notifications
        to all known remotes for quick cache synchronization, even
        before the service record is available in db. And if a remote
        service is unexpectedly down according to service group records,
        no more messages will send to it. The ComputeFilter is also
        removed because of this feature, the scheduler can detect remote
        compute nodes by itself.

        5. In fact the claim tracking is not only from schedulers to
        compute nodes, but also from compute-node host state to the
        resource tracker. One reason is that there is still a gap
        between a claim is acknowledged by compute-node host state and
        the claim is successful in resource tracker. It is necessary to
        track those unhandled claims to keep host state accurate. The
        second reason is to separate schedulers from compute node and
        resource trackers. Scheduler only export limited interfaces
        `update_from_compute` and `handle_rt_claim_failure` to compute
        service and the RT, so the testing and reusing are easier with
        clear boundaries.

        TODOs:

        There are still many features to be implemented, the most
        important are unit tests and incremental updates to PCI and NUMA
        resources, all of them are marked out inline.

        References:

        [1]
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104


        [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24

        [3] http://paste.openstack.org/show/486929/

        ----------------------------<<

        The original commit history of this prototype is located in
https://github.com/cyx1231st/nova/commits/shared-scheduler

        For instructions to install and test this prototype, please
        refer to the commit message of
        https://review.openstack.org/#/c/280047/

        Regards,

        -Yingxin





__________________________________________________________________________

        OpenStack Development Mailing List (not for usage questions)

Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__________________________________________________________________________

    OpenStack Development Mailing List (not for usage questions)

Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________

OpenStack Development Mailing List (not for usage questions)

Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to