Le 02/03/2016 09:15, Cheng, Yingxin a écrit :
On Tuesday, March 1, 2016 7:29 PM, John Garbutt <mailto:j...@johngarbutt.com>
wrote
On 1 March 2016 at 08:34, Cheng, Yingxin <yingxin.ch...@intel.com> wrote:
Hi,
I have simulated the distributed resource management with the incremental
update model based on Jay's benchmarking framework:
https://github.com/cyx1231st/placement-bench/tree/shared-state-
demonstration. The complete result lies at
http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
4GB RAM, and the mysql service is using the default settings with the
"innodb_buffer_pool_size" setting to "2G". The number of simulated compute
nodes are set to "300".
[...]
Second, here's what I've found in the centralized db claim design(i.e. rows that
"do claim in compute?" = No):
1. The speed of legacy python filtering is not slow(see rows that
"Filter strategy" = python): "Placement total query time" records the
cost of all query time including fetching states from db and filtering
using python. The actual cost of python filtering is
(Placement_total_query_time - Placement_total_db_query_time), and
that's only about 1/4 of total cost or even less. It also means python
in-memory filtering is much faster than db filtering in this
experiment. See http://paste.openstack.org/show/488710/
2. The speed of `db filter strategy` and the legacy `python filter
strategy` are in the same order of magnitude, not a very huge
improvement. See the comparison of column "Placement total query
time". Note that the extra cost of `python filter strategy` mainly
comes from "Placement total db query time"(i.e. fetching states from
db). See http://paste.openstack.org/show/488709/
I think it might be time to run this in a nova-scheduler like
environment: eventlet threads responding to rabbit, using pymysql backend, etc.
Note we should get quite a bit of concurrency within a single nova-scheduler
process with the db approach.
I suspect clouds that are largely full of pets, pack/fill first, with a smaller
percentage of cattle on top, will benefit the most, as that initial DB filter
will
bring back a small list of hosts.
Third, my major concern of "centralized db claim" design is: Putting too much
scheduling works into the centralized db, and it is not scalable by simply
adding
conductors and schedulers.
1. The filtering works are majorly done inside db by executing complex sqls. If
the filtering logic is much more complex(currently only CPU and RAM are
accounted in the experiment), the db overhead will be considerable.
So, to clarify, only resources we have claims for in the DB will be filtered in
the
DB. All other filters will still occur in python.
The good news, is that if that turns out to be the wrong trade off, its easy to
revert back to doing all the filtering in python, with zero impact on the DB
schema.
Another point is, the db filtering will recalculate every resources to get
their free value from inventories and allocations each time when a schedule
request comes. This overhead is unnecessary if scheduler can accept the
incremental updates to adjust its cache recording free resources.
It also means there must be a mechanism based on strict version control of
scheduler caches to make sure those updates are correctly handled.
2. The racing of centralized claims are resolved by rolling back transactions
and by checking the generations(see the design of "compare and update"
strategy in https://review.openstack.org/#/c/283253/), it also causes additional
overhead to db.
Its worth noting this pattern is designed to work well with a Galera DB cluster,
including one that has writes going to all the nodes.
I know, my point is the "distributed resource management" with resource
trackers doesn't need db-locks or db-rolling-backs to those compute-local resources as
well as the additional overhead, regardless of the type of databases.
Well, the transactional "compare-and-update" doesn't need to be done on
the scheduler side, but it will still be needed if we leave the compute
nodes updating their resources.
TBC, that's not because the scheduler doesn't need that we shouldn't
have the DB models having some "compare-and-update" strategy. But I got
your point :-)
3. The db overhead of filtering operation can be relaxed by moving
them to schedulers, that will be 38 times faster and can be executed
in parallel by schedulers according to the column "Placement avg query
time". See http://paste.openstack.org/show/488715/
4. The "compare and update" overhead can be partially relaxed by using
distributed resource claims in resource trackers. There is no need to roll back
transactions in updating inventories of compute local resources in order to be
accurate. It is confirmed by checking the db records at the end of each run of
eventually consistent scheduler state design.
5. If a) all the filtering operations are done inside schedulers,
b) schedulers do not need to refresh caches from db because of
incremental updates,
c) it is no need to do "compare and update" to compute-local
resources(i.e. none-shared resources),
then here is the performance comparison using 1 scheduler
instances: http://paste.openstack.org/show/488717/
The other side of the coin here is sharding.
For example, we could have a dedicated DB cluster for just the scheduler data
(need to add code to support that, but should be possible now, I believe).
Consider if you have three types of hosts, that map directly to specific
flavors.
You can shard your scheduler and DB clusters into those groups (i.e. compute
node inventory lives only in one of the shards). When the request comes in you
just route appropriate build requests to each of the scheduler clusters.
If you have a large enough deployment, you can shard your hosts across several
DB clusters, and use a modulo or random sharding stragegy to pick which cluster
the request lands on. There are issues around ensuring you do capacity planning
that takes into account those shards, but aligning the shards with Availability
Zones, or similar, might stop that being an additional burden.
I think both designs can support sharding. If I understand it correctly, it means the
scheduler will use the "modulo" partition strategy as in the experiment.
As you can see, there are no performance or accuracy regression in the
emulation of my design.
Finally, it is not fair to directly compare the actual ability of
resource-provider
scheduler and shared-state scheduler using this benchmarking tool, because
there are 300 more processes needed to be created in order to simulate the
distributed resource management of 300 compute nodes, and there are no
conductors and MQ in the simulation. But I think it is still useful to provide
at
least some statistics.
Really looking forward to a simulator that can test them all in slightly more
real
way (maybe using fake virt, and full everything else?).
I've made good progresses on that, almost the same with what you are expecting,
so wait and see :-)
Before doing any conclusion or going one way or another, I'd love to
see, like John said, some fake implementation using oslo.messaging fake
driver (so we could simulate a MQ protocol overhead). Not sure a fake
virt driver is necessary, since the RT is not really using it (just the
figures it gives) so just compute workers sending resource consumption
over the fake wire to some scheduler workers should be sufficient to
compare and have a better accuracy.
Thoughts on that?
-Sylvain
Thanks,
John
Regards,
-Yingxin
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev