Le 05/03/2015 13:00, Nikola Đipanov a écrit :
On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
Le 04/03/2015 04:51, Rui Chen a écrit :
Hi all,

I want to make it easy to launch a bunch of scheduler processes on a
host, multiple scheduler workers will make use of multiple processors
of host and enhance the performance of nova-scheduler.

I had registered a blueprint and commit a patch to implement it.
https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support

This patch had applied in our performance environment and pass some
test cases, like: concurrent booting multiple instances, currently we
didn't find inconsistent issue.

IMO, nova-scheduler should been scaled horizontally on easily way, the
multiple workers should been supported as an out of box feature.

Please feel free to discuss this feature, thanks.

As I said when reviewing your patch, I think the problem is not just
making sure that the scheduler is thread-safe, it's more about how the
Scheduler is accounting resources and providing a retry if those
consumed resources are higher than what's available.

Here, the main problem is that two workers can actually consume two
distinct resources on the same HostState object. In that case, the
HostState object is decremented by the number of taken resources (modulo
what means a resource which is not an Integer...) for both, but nowhere
in that section, it does check that it overrides the resource usage. As
I said, it's not just about decorating a semaphore, it's more about
rethinking how the Scheduler is managing its resources.


That's why I'm -1 on your patch until [1] gets merged. Once this BP will
be implemented, we will have a set of classes for managing heterogeneous
types of resouces and consume them, so it would be quite easy to provide
a check against them in the consume_from_instance() method.

I feel that the above explanation does not give the full picture in
addition to being factually incorrect in several places. I have come to
realize that the current behaviour of the scheduler is subtle enough
that just reading the code is not enough to understand all the edge
cases that can come up. The evidence being that it trips up even people
that have spent significant time working on the code.

It is also important to consider the design choices in terms of
tradeoffs that they were trying to make.

So here are some facts about the way Nova does scheduling of instances
to compute hosts, considering the amount of resources requested by the
flavor (we will try to put the facts into a bigger picture later):

* Scheduler receives request to chose hosts for one or more instances.
* Upon every request (_not_ for every instance as there may be several
instances in a request) the scheduler learns the state of the resources
on all compute nodes from the central DB. This state may be inaccurate
(meaning out of date).
* Compute resources are update by each compute host periodically. This
is done by updating the row in the DB.
* The wall-clock time difference between the scheduler deciding to
schedule an instance, and the resource consumption being reflected in
the data the scheduler learns from the DB can be arbitrarily long (due
to load on the compute nodes and latency of message arrival).
* To cope with the above, there is a concept of retrying the request
that fails on a certain compute node due to the scheduling decision
being made with data stale at the moment of build, by default we will
retry 3 times before giving up.
* When running multiple instances, decisions are made in a loop, and
internal in-memory view of the resources gets updated (the widely
misunderstood consume_from_instance method is used for this), so as to
keep subsequent decisions as accurate as possible. As was described
above, this is all thrown away once the request is finished.

Now that we understand the above, we can start to consider what changes
when we introduce several concurrent scheduler processes.

Several cases come to mind:
* Concurrent requests will no longer be serialized on reading the state
of all hosts (due to how eventlet interacts with mysql driver).
* In the presence of a single request for a large number of instances
there is going to be a drift in accuracy of the decisions made by other
schedulers as they will not have the accounted for any of the instances
until they actually get claimed on their respective hosts.

All of the above limitations will likely not pose a problem under normal
load and usage and can cause issues to start appearing when nodes are
close to full or when there is heavy load. Also this changes drastically
based on how we actually chose to utilize hosts (see a very interesting
Ironic bug [1])

Weather any of the above matters to users is dependant heavily on their
use-case though. This is why I feel we should be providing more information.

Finally - I think it is important to accept that the scheduler service
will always have to operate under the assumptions of stale data, and
build for that. Based on that I'd be happy to see real work go into
making multiple schedulers work well enough for most common use-cases
while providing a way forward for people who need tighter bounds on the
feedback loop.

N.

Agreed 100% with all your above email. Thanks Nikola for giving time on explaining how the Scheduler is working, that's (btw.) something I hope to be presenting for the Vancouver Summit if my proposal is accepted.

That said, I hope my reviewers will understand that I would want to see first the Scheduler being splitted and being on a separate repo before working on fixing the race conditions you mentioned above. Yes, I know, it's difficult to accept some limitations on the Nova scheduler while many customers would want them to be fixed, but here we have so many technical debt issues that I think we should really work on the split itself (like we did for Kilo and what we'll hopefully work for Liberty) and then discuss on the new design once after that.

-Sylvain


[1] https://bugs.launchpad.net/nova/+bug/1341420

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to