Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza Wed, 02 Mar 2016 00:41:41 -0800


Le 02/03/2016 09:15, Cheng, Yingxin a écrit :

On Tuesday, March 1, 2016 7:29 PM, John Garbutt <mailto:j...@johngarbutt.com> 
wrote

On 1 March 2016 at 08:34, Cheng, Yingxin <yingxin.ch...@intel.com> wrote:

Hi,

I have simulated the distributed resource management with the incremental

update model based on Jay's benchmarking framework:
https://github.com/cyx1231st/placement-bench/tree/shared-state-
demonstration. The complete result lies at
http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
4GB RAM, and the mysql service is using the default settings with the
"innodb_buffer_pool_size" setting to "2G". The number of simulated compute
nodes are set to "300".

[...]

Second, here's what I've found in the centralized db claim design(i.e. rows that

"do claim in compute?" = No):

1. The speed of legacy python filtering is not slow(see rows that
"Filter strategy" = python): "Placement total query time" records the
cost of all query time including fetching states from db and filtering
using python. The actual cost of python filtering is
(Placement_total_query_time - Placement_total_db_query_time), and
that's only about 1/4 of total cost or even less. It also means python
in-memory filtering is much faster than db filtering in this
experiment. See http://paste.openstack.org/show/488710/
2. The speed of `db filter strategy` and the legacy `python filter
strategy` are in the same order of magnitude, not a very huge
improvement. See the comparison of column "Placement total query
time". Note that the extra cost of `python filter strategy` mainly
comes from "Placement total db query time"(i.e. fetching states from
db). See http://paste.openstack.org/show/488709/

I think it might be time to run this in a nova-scheduler like
environment: eventlet threads responding to rabbit, using pymysql backend, etc.
Note we should get quite a bit of concurrency within a single nova-scheduler
process with the db approach.

I suspect clouds that are largely full of pets, pack/fill first, with a smaller
percentage of cattle on top, will benefit the most, as that initial DB filter 
will
bring back a small list of hosts.

Third, my major concern of "centralized db claim" design is: Putting too much

scheduling works into the centralized db, and it is not scalable by simply 
adding
conductors and schedulers.

1. The filtering works are majorly done inside db by executing complex sqls. If

the filtering logic is much more complex(currently only CPU and RAM are
accounted in the experiment), the db overhead will be considerable.

So, to clarify, only resources we have claims for in the DB will be filtered in 
the
DB. All other filters will still occur in python.

The good news, is that if that turns out to be the wrong trade off, its easy to
revert back to doing all the filtering in python, with zero impact on the DB
schema.

Another point is, the db filtering will recalculate every resources to get 
their free value from inventories and allocations each time when a schedule 
request comes. This overhead is unnecessary if scheduler can accept the 
incremental updates to adjust its cache recording free resources.
It also means there must be a mechanism based on strict version control of 
scheduler caches to make sure those updates are correctly handled.

2. The racing of centralized claims are resolved by rolling back transactions

and by checking the generations(see the design of "compare and update"
strategy in https://review.openstack.org/#/c/283253/), it also causes additional
overhead to db.

Its worth noting this pattern is designed to work well with a Galera DB cluster,
including one that has writes going to all the nodes.

I know, my point is the "distributed resource management" with resource 
trackers doesn't need db-locks or db-rolling-backs to those compute-local resources as 
well as the additional overhead, regardless of the type of databases.

Well, the transactional "compare-and-update" doesn't need to be done onthe scheduler side, but it will still be needed if we leave the computenodes updating their resources.TBC, that's not because the scheduler doesn't need that we shouldn'thave the DB models having some "compare-and-update" strategy. But I gotyour point :-)

3. The db overhead of filtering operation can be relaxed by moving
them to schedulers, that will be 38 times faster and can be executed
in parallel by schedulers according to the column "Placement avg query
time". See http://paste.openstack.org/show/488715/
4. The "compare and update" overhead can be partially relaxed by using

distributed resource claims in resource trackers. There is no need to roll back
transactions in updating inventories of compute local resources in order to be
accurate. It is confirmed by checking the db records at the end of each run of
eventually consistent scheduler state design.

5. If a) all the filtering operations are done inside schedulers,
         b) schedulers do not need to refresh caches from db because of

incremental updates,

         c) it is no need to do "compare and update" to compute-local

resources(i.e. none-shared resources),

      then here is the performance comparison using 1 scheduler
instances: http://paste.openstack.org/show/488717/

The other side of the coin here is sharding.

For example, we could have a dedicated DB cluster for just the scheduler data
(need to add code to support that, but should be possible now, I believe).

Consider if you have three types of hosts, that map directly to specific 
flavors.
You can shard your scheduler and DB clusters into those groups (i.e. compute
node inventory lives only in one of the shards). When the request comes in you
just route appropriate build requests to each of the scheduler clusters.

If you have a large enough deployment, you can shard your hosts across several
DB clusters, and use a modulo or random sharding stragegy to pick which cluster
the request lands on. There are issues around ensuring you do capacity planning
that takes into account those shards, but aligning the shards with Availability
Zones, or similar, might stop that being an additional burden.

I think both designs can support sharding. If I understand it correctly, it means the 
scheduler will use the "modulo" partition strategy as in the experiment.
As you can see, there are no performance or accuracy regression in the 
emulation of my design.

Finally, it is not fair to directly compare the actual ability of 
resource-provider

scheduler and shared-state scheduler using this benchmarking tool, because
there are 300 more processes needed to be created in order to simulate the
distributed resource management of 300 compute nodes, and there are no
conductors and MQ in the simulation. But I think it is still useful to provide 
at
least some statistics.

Really looking forward to a simulator that can test them all in slightly more 
real
way (maybe using fake virt, and full everything else?).

I've made good progresses on that, almost the same with what you are expecting, 
so wait and see :-)

Before doing any conclusion or going one way or another, I'd love tosee, like John said, some fake implementation using oslo.messaging fakedriver (so we could simulate a MQ protocol overhead). Not sure a fakevirt driver is necessary, since the RT is not really using it (just thefigures it gives) so just compute workers sending resource consumptionover the fake wire to some scheduler workers should be sufficient tocompare and have a better accuracy.


Thoughts on that?
-Sylvain

Thanks,
John


Regards,
-Yingxin

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to