Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza Wed, 24 Feb 2016 01:41:04 -0800


Le 24/02/2016 01:10, Jay Pipes a écrit :

On 02/22/2016 04:23 AM, Sylvain Bauza wrote:

I won't argue against performance here. You made a very nice PoC for
testing scaling DB writes within a single python process and I trust
your findings. While I would be naturally preferring some shared-nothing
approach that can horizontally scale, one could mention that we can do
the same with Galera clusters.

a) My benchmarks aren't single process comparisons. They aremulti-process benchmarks.

Sorry, was unclear. I meant that I trust you for benchmarking theread/write performance of the models you proposed. The 'single process'was meant to explain that you propose a python script for that, but sureyou can provide a multi-process call.Again, no doubt against your findings. TBH, I had litterally no time toresurrect my testbed for that, but I'll play with your scripts as soonas I have time, it's one of my priorities.

b) The approach I've taken is indeed shared-nothing. The schedulerprocesses do not share any data whatsoever.
c) Galera isn't horizontally scalable. Never was, never will be. Thatisn't its strong-suit. Galera is best for having asynchronously-replicated database cluster that is incredibly easy tomanage and administer but it isn't a performance panacea. It's focusis on availability not performance :)

Fair and valid point. I said Galera as an example for saying thatoperators can try to horizontally scale the DBMS if they need, vs. usingthe existing model of scaling the number of conductors if we keep theresource ownership by the compute nodes.

That said, most of the operators run a controller/compute situation
where all the services but the compute node are hosted on 1:N hosts.
Implementing the resource-providers-scheduler BP (and only that one)
will dramatically increase the number of writes we do on the scheduler
process (ie. on the "controller" - quoting because there is no notion of
a "controller" in Nova, it's just a deployment choice).
Yup, no doubt about it. It won't increase the *total* number of writesthe system makes, just the concentration of those writes into thescheduler processes. You are trading increased writes in the schedulerfor the challenges inherent in keeping a large distributed cachesystem valid and fresh (which itself introduces a different kind ofwrites).

No, it will increase the number of writes, because you're planning towrite every time a request comes in, right ?If so, it's scaling per requests, compared to the existing model wherecomputes write their data every 60 secs, so it scales per compute.

Since the number of requests is at least one order of magnitude higherthan the number of computes, my guts are that it'll write far more the DB.

That's a big game changer for operators who are currently capping their
capacity by adding more conductors. It would require them to do some DB
modifications to be able to scale their capacity. I'm not against that,
I just say it's a big thing that we need to consider and properly
communicate if agreed.
Agreed completely. I will say, however, that on a 1600 compute nodesimulation (~60K variably-sized instances), an untuned stock MySQL 5.6database with 128MB InnoDB buffer pool size barely breaks a sweat onmy local machine.

Nice, that makes me very comfortable with your approach from aperformance PoV :-)

Great work btw.

> It can be alleviated by changing to a stand-alone high

 performance database.


It doesn't need to be high-performance at all. In my benchmarks, a
small-sized stock MySQL database server is able to fulfill thousands
of placement queries and claim transactions per minute using
completely isolated non-shared, non-caching scheduler processes.

> And the cache refreshing is designed to be

replaced by to direct SQL queries according to resource-provider
scheduler spec [2].


Yes, this is correct.

> The performance bottlehead of shared-state scheduler

may come from the overwhelming update messages, it can also be
alleviated by changing to stand-alone distributed message queue and by
using the “MessagePipe” to merge messages.


In terms of the number of messages used in each design, I see the
following relationship:

resource-providers < legacy < shared-state-scheduler

would you agree with that?


True. But that's manageable by adding more conductors, right ? IMHO,
Nova performance is bound by the number of conductors you run and I like
that - because that's easy to increase the capacity.
Also, the payload could be far smaller from the existing : instead of
sending a full update for a single compute_node entry, it would only
send the diff (+ some full syncs periodically). We would then mitigate
the messages increase by making sure we're sending less per message.

No message sent is better than sending any message, regardless ofwhether that message contains an incremental update or a full object.


Well, it's a tautology :-)

I'm just wanting to explain that having a "eventually-consistent"scheduler is somehow manageable by having the same scaling factor thatwe currently have (ie. number of conductors).

The resource-providers proposal actually uses no update messages at
all (except in the abnormal case of a compute node failing to start
the resources that had previously been claimed by the scheduler). All

updates are done in a single database transaction when the claim ismade.


See, I don't think that a compute node unable to start a request is an
'abnormal case'. There are many reasons why a request can't be honored
by the compute node :
  - for example, the scheduler doesn't own all the compute resources and
thus can miss some information : for example, say that you want to pin a
specific pCPU and this pCPU is already assigned. The scheduler doesn't
know *which* pCPUs are free, it only knows *how much* are free
That atomic transaction (pick a free pCPU and assign it to the instance)
is made on the compute manager not at the exact same time we're
decreasing resource usage for pCPUs (because it would be done in the
scheduler process).


See my response to Chris Friesen about the above.

  - some "reserved" RAM or disk could be underestimated and
consequently, spawning a VM could be either taking fare more time than
planned (which would mean that it would be a suboptimal placement) or it
would fail which would issue a reschedule.


Again, the above is an abnormal case.

<snip>

It's a distributed problem. Neither the shared-state scheduler can have
an accurate view (because it will only guarantee that it will
*eventually* be consistent), nor the scheduler process can have an
accurate view (because the scheduler doesn't own the resource usage that
is made by compute nodes)

If the scheduler claimed the resources on the compute node, it *would*own those resources, though, and therefore it *would* have a 99.9%accurate view of the inventory of the system.

It's not a distributed problem if you don't make it a distributedproblem.

There's nothing wrong with the compute node having the final "say"about if some claimed-by-the-scheduler resources truly cannot beprovided by the compute node, but again, that is a) an abnormaloperating condition and b) simply would trigger a notification to thescheduler to free said resources.

So, I got your point. I just need to consider more if that's an abnormaland thus exceptional case, or if it's part of the design, because that'swhere the truth is.I don't know yet, I just mean that this can't be easily tackled bysaying "meh, it's non significant".

4. Design goal difference:

The fundamental design goal of the two new schedulers is different.Copy

my views from [2], I think it is the choice between “the loose
distributed consistency with retries” and “the strict centralized
consistency with locks”.


There are a couple other things that I believe we should be

documenting, considering and measuring with regards to schedulerdesigns:


a) Debuggability

The ability of a system to be debugged and for requests to that system
to be diagnosed is a critical component to the benefit of a particular
system design. I'm hoping that by removing a lot of the moving parts
from the legacy filter scheduler design (removing the caching,
removing the Python-side filtering and weighing, removing the interval
between which placement decisions can conflict, removing the cost and
frequency of retry operations) that the resource-provider scheduler
design will become simpler for operators to use.

Keep in mind that scheduler filters are a very handy pluggable system
for operators wanting to implement their own placement logic. If you
want to deprecate filters (and I'm not opiniated on that), just make
sure that you keep that extension capability.


Understood, and I will do my best to provide some level of extensibility.

I appreciate. Something unclear in my mind is : our extensible model isrun per host, not as a full clause when getting the list of compute nodes.Getting what will be part of the initial filtering clause and what willbe kept as filters will be greatly appreciated, I guess.

b) Simplicity

Goes to the above point about debuggability, but I've always tried to
follow the mantra that the best software design is not when you've
added the last piece to it, but rather when you've removed the last
piece from it and still have a functioning and performant system.
Having a scheduler that can tackle the process of tracking resources,
deciding on placement, and claiming those resources instead of playing
an intricate dance of keeping state caches valid will, IMHO, lead to a
better scheduler.


I'm also very concerned by the upgrade path for both proposals like I
said before. I haven't yet seen how either of those 2 proposals are
changing the existing FilterScheduler and what is subject to change.
Also, please keep in mind that we support rolling upgrades, so we need
to support old compute nodes.

Nothing about any of the resource-providers specs inhibits rollingupgrades. In fact, two of the specs (resource-providers-allocationsand compute-node-inventory) are *only* about how to do online datamigration from the legacy DB schema to the resource-providers schema. :)

Sorry, my concern about rolling upgrades was not aimed to your propsal,but rather to the 'eventually-consistent' BP.My first phrase about what will change in the FilterScheduler is somehowrelated to my previous comment, where I'm unclear about what will changein the scheduler, and what won't.

From my understanding, your proposal is to refine the existing dummyComputeNodeList.get_all() call we make in the HostManager, and insteadprovide an object call which would eventually get the list of computenodes that are having enough disk, RAM and CPU.

In that case, I suppose you plan to deprecate RAMFilter, DiskFilter andCoreFilter. Fair enough, but are you planning to deprecateAggregateRAMFilter, AggregateCoreFilter and other filters like NUMA ones ?



-Sylvain

Best,
-jay



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to