Re: [openstack-dev] [nova] Update on scheduler and resource tracker progress

Jay Pipes Thu, 18 Feb 2016 16:40:02 -0800

On 02/18/2016 07:16 PM, Clint Byrum wrote:

Excerpts from Jay Pipes's message of 2016-02-18 11:33:04 -0800:
I'm talking about the destination host selection process too, but I was
just assuming you'd need compound indexes to make this really efficient,
and I assumed that would mean more indexes than exist today.

Well, it's an *entirely* different schema than exists today... kind oftough to compare based on the existence of compound indexes in the newschema (which uses integer sums for all resource amount comparisons) toa schema that uses JSON blobs for some resources, integer fields forsome resources, and entirely different tables for other resources(pci_devices) ;)

So, I guess what I may have missed was that these indexes already exist.

None of them exist in the current database schema. The new schema doeshave indexes:


CREATE TABLE resource_providers (
  id INT NOT NULL,
  uuid CHAR(36) NOT NULL,
  name VARCHAR(200) NULL,
  can_host INT NOT NULL,
  generation INT NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY (uuid)
);

CREATE TABLE inventories (
  resource_provider_id INT NOT NULL,
  resource_class_id INT NOT NULL,
  total INT NOT NULL,
  reserved INT NOT NULL,
  min_unit INT NOT NULL,
  max_unit INT NOT NULL,
  step_size INT NOT NULL,
  allocation_ratio FLOAT NOT NULL,
  PRIMARY KEY (resource_provider_id, resource_class_id),
  INDEX (resource_class_id)
);

CREATE TABLE IF NOT EXISTS allocations (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  resource_provider_id INT NOT NULL,
  resource_class_id INT NOT NULL,
  consumer_uuid CHAR(36) NOT NULL,
  used INT NOT NULL,
  created_at DATETIME NOT NULL,
  INDEX (resource_provider_id, resource_class_id),
  INDEX (consumer_uuid),
  INDEX (resource_class_id, resource_provider_id, used)
);

Lemme know if you spot somewhere that would benefit from alternateindexes or if you disagree with the indexing placed on the above tables.

As you would expect, the larger the size of the deployment, the greater
the performance benefit you see using the DB for querying instead of
Python (lower numbers are better here):

DB or Python   # Compute Nodes   Avg Time to Select    Delta
------------------------------------------------------------
DB             100               0.021035
Python         100               0.022517              +7.0%
DB             200               0.023370
Python         200               0.026526             +13.5%
DB             400               0.027638
Python         400               0.034666             +25.4%
DB             800               0.034814
Python         800               0.048271             +38.6%

The above was for a serialized scenario (1 scheduler process). Parallel
operations at 2, 4 and 8 scheduler processes were virtually identical as
can be expected since this is testing the read operation performance,
not the write operations.


I am not surprised at these results at all. However, I am still a little
wary of anything that happens faster in a central resource. Faster
is great, but it also means we now have to scale _up_ that central
resource. Hopefully it is so much more efficient to read indexes from
that DB instead of filter lists in python that we get a very large margin
between what lots of slow python processes could have done and what one
very fast mysqld can do.

Sure, I understand your concerns. I built the placement-bench projectprecisely to get data to inform us of the benefits and drawbacks ofdifferent approaches. Hopefully that data will allow us to make gooddecisions in the future.

  > With 1000 active compute nodes updating their status,

each index added will be 1000 more index writes per update period. Still
a net win, but I'm always cautious about shifting things to more writes
on the database server. That said, I do think it will be a win and should
be done.


Again, this isn't what the "move the filtering to the database query"
proposal is about :) You are describing the *claim* operation above, not
the select-destination operation.

The *current* scheduler design is what has each distributed compute node
sending updates to the scheduler^Wdatabase each time a claim occurs.
What the second part of my proposal does is move the claim from the
distributed compute nodes and into the scheduler, which should allow the
scheduler to operate on non-stale data (which will reduce the number of
long retry operations). More below.

The second major scale problem with the current Nova scheduler design
has to do with the fact that the scheduler does *not* actually claim
resources on a provider. Instead, the scheduler selects a destination
host to place the instance on and the Nova conductor then sends a
message to that target host which attempts to spawn the instance on its
hypervisor. If the spawn succeeds, the target compute host updates the
Nova database and decrements its count of available resources. These
steps (from nova-scheduler to nova-conductor to nova-compute to
database) all take some not insignificant amount of time. During this
time window, a different scheduler process may pick the exact same
target host for a like-sized launch request. If there is only room on
the target host for one of those size requests [5], one of those spawn
requests will fail and trigger a retry operation. This retry operation
will attempt to repeat the scheduler placement decisions (by calling
select_destinations()).

This retry operation is relatively expensive and needlessly so: if the
scheduler claimed the resources on the target host before sending its
pick back to the scheduler, then the chances of producing a retry will
be almost eliminated [6]. The resource-providers-scheduler blueprint
attempts to remedy this second scaling design problem by having the
scheduler write records to the allocations table before sending the
selected target host back to the Nova conductor.


*This*, to me, is the thing that makes the scheduler dramatically more
scalable. The ability to run as many schedulers as I expect to need to
respond to user requests in a reasonable amount of time, is the key to
victory here.

However, I wonder how you will avoid serialization or getting into
a much tighter retry race for the claiming operations. There's talk
in the spec of inserting allocations in a table atomically. However,
with multiple schedulers, you'll still have the problem where one will
claim and the others will need to know that they cannot.


This is handled in my proposal with a single database transaction that
looks at a "generation" column on each resource provider and rolls back
the transaction if the generation is not the same as what was read
during the select-destination process.

  > We can talk

about nuts and bolts, but there's really only two ways this can work:
exclusive locking, or compare and swap retry loops.


Yup. Compare and swap is what I propose and have implemented in the
placement-bench project here:

https://github.com/jaypipes/placement-bench/blob/master/placement.py#L123-L129

triggering a retry here:

https://github.com/jaypipes/placement-bench/blob/master/placement.py#L212-L217

Exclusive locking -- i.e. SELECT FOR UPDATE -- won't work on Galera
systems in multi-writer mode, as you already know :)


I would actually disagree here. It can totally work, and in fact, Galera
is basically doing _exactly_ what you describe with the generation
column, inside its own mechanisms, it's just using a rather obtuse way
of signalling to you that the generations got of out sync, by saying
you had a deadlock and automatically rolling back when you thought you
wanted to commit.

They're both very similar in mechanism, but one is buried deep in
Galera, and one is easier to read and has the benefit of being an
explicit approach.

In my initial benchmarks, I have found that this compare and swap
approach works OK at scale (higher numbers are better here):

# Compute Nodes   Successful claims per second
100               54.1
200               68.9
400               51.3
800               34.3

All of the above numbers are for 8 scheduler processes, using a
pack-first placement strategy and using no partitioning strategy (so,
pretty much worst-case scenario).

Using a simple modulo partitioning strategy but staying with the
pack-first placement strategy, I got much better results:

# Compute Nodes   Successful claims per second
100               97.1
200               124.5
400               115.1
800               89.4


That's about 50 times better than what I saw on Kilo with 2 schedulers
and 1000 simulated nodes, so huzzah!


:)

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Update on scheduler and resource tracker progress

Reply via email to