Le 24/02/2016 01:10, Jay Pipes a écrit :
On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
I won't argue against performance here. You made a very nice PoC for
testing scaling DB writes within a single python process and I trust
your findings. While I would be naturally preferring some shared-nothing
approach that can horizontally scale, one could mention that we can do
the same with Galera clusters.

a) My benchmarks aren't single process comparisons. They are multi-process benchmarks.

Sorry, was unclear. I meant that I trust you for benchmarking the read/write performance of the models you proposed. The 'single process' was meant to explain that you propose a python script for that, but sure you can provide a multi-process call. Again, no doubt against your findings. TBH, I had litterally no time to resurrect my testbed for that, but I'll play with your scripts as soon as I have time, it's one of my priorities.



b) The approach I've taken is indeed shared-nothing. The scheduler processes do not share any data whatsoever.

c) Galera isn't horizontally scalable. Never was, never will be. That isn't its strong-suit. Galera is best for having a synchronously-replicated database cluster that is incredibly easy to manage and administer but it isn't a performance panacea. It's focus is on availability not performance :)


Fair and valid point. I said Galera as an example for saying that operators can try to horizontally scale the DBMS if they need, vs. using the existing model of scaling the number of conductors if we keep the resource ownership by the compute nodes.

That said, most of the operators run a controller/compute situation
where all the services but the compute node are hosted on 1:N hosts.
Implementing the resource-providers-scheduler BP (and only that one)
will dramatically increase the number of writes we do on the scheduler
process (ie. on the "controller" - quoting because there is no notion of
a "controller" in Nova, it's just a deployment choice).

Yup, no doubt about it. It won't increase the *total* number of writes the system makes, just the concentration of those writes into the scheduler processes. You are trading increased writes in the scheduler for the challenges inherent in keeping a large distributed cache system valid and fresh (which itself introduces a different kind of writes).


No, it will increase the number of writes, because you're planning to write every time a request comes in, right ? If so, it's scaling per requests, compared to the existing model where computes write their data every 60 secs, so it scales per compute.

Since the number of requests is at least one order of magnitude higher than the number of computes, my guts are that it'll write far more the DB.


That's a big game changer for operators who are currently capping their
capacity by adding more conductors. It would require them to do some DB
modifications to be able to scale their capacity. I'm not against that,
I just say it's a big thing that we need to consider and properly
communicate if agreed.

Agreed completely. I will say, however, that on a 1600 compute node simulation (~60K variably-sized instances), an untuned stock MySQL 5.6 database with 128MB InnoDB buffer pool size barely breaks a sweat on my local machine.


Nice, that makes me very comfortable with your approach from a performance PoV :-)
Great work btw.

> It can be alleviated by changing to a stand-alone high
 performance database.

It doesn't need to be high-performance at all. In my benchmarks, a
small-sized stock MySQL database server is able to fulfill thousands
of placement queries and claim transactions per minute using
completely isolated non-shared, non-caching scheduler processes.

> And the cache refreshing is designed to be
replaced by to direct SQL queries according to resource-provider
scheduler spec [2].

Yes, this is correct.

> The performance bottlehead of shared-state scheduler
may come from the overwhelming update messages, it can also be
alleviated by changing to stand-alone distributed message queue and by
using the “MessagePipe” to merge messages.

In terms of the number of messages used in each design, I see the
following relationship:

resource-providers < legacy < shared-state-scheduler

would you agree with that?

True. But that's manageable by adding more conductors, right ? IMHO,
Nova performance is bound by the number of conductors you run and I like
that - because that's easy to increase the capacity.
Also, the payload could be far smaller from the existing : instead of
sending a full update for a single compute_node entry, it would only
send the diff (+ some full syncs periodically). We would then mitigate
the messages increase by making sure we're sending less per message.

No message sent is better than sending any message, regardless of whether that message contains an incremental update or a full object.

Well, it's a tautology :-)
I'm just wanting to explain that having a "eventually-consistent" scheduler is somehow manageable by having the same scaling factor that we currently have (ie. number of conductors).

The resource-providers proposal actually uses no update messages at
all (except in the abnormal case of a compute node failing to start
the resources that had previously been claimed by the scheduler). All
updates are done in a single database transaction when the claim is made.

See, I don't think that a compute node unable to start a request is an
'abnormal case'. There are many reasons why a request can't be honored
by the compute node :
  - for example, the scheduler doesn't own all the compute resources and
thus can miss some information : for example, say that you want to pin a
specific pCPU and this pCPU is already assigned. The scheduler doesn't
know *which* pCPUs are free, it only knows *how much* are free
That atomic transaction (pick a free pCPU and assign it to the instance)
is made on the compute manager not at the exact same time we're
decreasing resource usage for pCPUs (because it would be done in the
scheduler process).

See my response to Chris Friesen about the above.

  - some "reserved" RAM or disk could be underestimated and
consequently, spawning a VM could be either taking fare more time than
planned (which would mean that it would be a suboptimal placement) or it
would fail which would issue a reschedule.

Again, the above is an abnormal case.

<snip>

It's a distributed problem. Neither the shared-state scheduler can have
an accurate view (because it will only guarantee that it will
*eventually* be consistent), nor the scheduler process can have an
accurate view (because the scheduler doesn't own the resource usage that
is made by compute nodes)

If the scheduler claimed the resources on the compute node, it *would* own those resources, though, and therefore it *would* have a 99.9% accurate view of the inventory of the system.

It's not a distributed problem if you don't make it a distributed problem.

There's nothing wrong with the compute node having the final "say" about if some claimed-by-the-scheduler resources truly cannot be provided by the compute node, but again, that is a) an abnormal operating condition and b) simply would trigger a notification to the scheduler to free said resources.


So, I got your point. I just need to consider more if that's an abnormal and thus exceptional case, or if it's part of the design, because that's where the truth is. I don't know yet, I just mean that this can't be easily tackled by saying "meh, it's non significant".

4. Design goal difference:

The fundamental design goal of the two new schedulers is different. Copy
my views from [2], I think it is the choice between “the loose
distributed consistency with retries” and “the strict centralized
consistency with locks”.

There are a couple other things that I believe we should be
documenting, considering and measuring with regards to scheduler designs:

a) Debuggability

The ability of a system to be debugged and for requests to that system
to be diagnosed is a critical component to the benefit of a particular
system design. I'm hoping that by removing a lot of the moving parts
from the legacy filter scheduler design (removing the caching,
removing the Python-side filtering and weighing, removing the interval
between which placement decisions can conflict, removing the cost and
frequency of retry operations) that the resource-provider scheduler
design will become simpler for operators to use.

Keep in mind that scheduler filters are a very handy pluggable system
for operators wanting to implement their own placement logic. If you
want to deprecate filters (and I'm not opiniated on that), just make
sure that you keep that extension capability.

Understood, and I will do my best to provide some level of extensibility.


I appreciate. Something unclear in my mind is : our extensible model is run per host, not as a full clause when getting the list of compute nodes. Getting what will be part of the initial filtering clause and what will be kept as filters will be greatly appreciated, I guess.

b) Simplicity

Goes to the above point about debuggability, but I've always tried to
follow the mantra that the best software design is not when you've
added the last piece to it, but rather when you've removed the last
piece from it and still have a functioning and performant system.
Having a scheduler that can tackle the process of tracking resources,
deciding on placement, and claiming those resources instead of playing
an intricate dance of keeping state caches valid will, IMHO, lead to a
better scheduler.

I'm also very concerned by the upgrade path for both proposals like I
said before. I haven't yet seen how either of those 2 proposals are
changing the existing FilterScheduler and what is subject to change.
Also, please keep in mind that we support rolling upgrades, so we need
to support old compute nodes.

Nothing about any of the resource-providers specs inhibits rolling upgrades. In fact, two of the specs (resource-providers-allocations and compute-node-inventory) are *only* about how to do online data migration from the legacy DB schema to the resource-providers schema. :)


Sorry, my concern about rolling upgrades was not aimed to your propsal, but rather to the 'eventually-consistent' BP. My first phrase about what will change in the FilterScheduler is somehow related to my previous comment, where I'm unclear about what will change in the scheduler, and what won't.

From my understanding, your proposal is to refine the existing dummy ComputeNodeList.get_all() call we make in the HostManager, and instead provide an object call which would eventually get the list of compute nodes that are having enough disk, RAM and CPU.

In that case, I suppose you plan to deprecate RAMFilter, DiskFilter and CoreFilter. Fair enough, but are you planning to deprecate AggregateRAMFilter, AggregateCoreFilter and other filters like NUMA ones ?


-Sylvain

Best,
-jay


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to