On Oct 7, 2015, at 6:00 PM, Chris Friesen <[email protected]> wrote:

> I've wondered for a while (ever since I looked at the scheduler code, really) 
> why we couldn't implement more of the scheduler as database transactions.
> 
> I haven't used Cassandra, so maybe you can clarify something about updates 
> across a distributed DB.  I just read up on lightweight transactions, and it 
> says that they're restricted to a single partition.  Is that an acceptable 
> limitation for this usage?

An implementation detail. A partition is defined by the partition key, not by 
any physical arrangement of nodes. The partition key would have to depend on 
the resource type, and whatever other columns would make such a query unique.

> Some points that might warrant further discussion:
> 
> 1) Some resources (RAM) only require tracking amounts.  Other resources 
> (CPUs, PCI devices) require tracking allocation of specific individual host 
> resources (for CPU pinning, PCI device allocation, etc.).  Presumably for the 
> latter we would have to actually do the allocation of resources at the time 
> of the scheduling operation in order to update the database with the claimed 
> resources in a race-free way.

Yes, that's correct. A lot of thought would have to be put into how to best 
represent these different types of resources, and that's something that I have 
ideas about, but would feel a whole lot better defining only after talking 
these concepts over with others who understand the underlying concepts better 
than I do.

> 2) Are you suggesting that all of nova switch to Cassandra, or just the 
> scheduler and resource tracking portions?  If the latter, how would we handle 
> things like pinned CPUs and PCI devices that are currently associated with 
> specific instances in the nova DB?

I am only thinking of the scheduler as a separate service. Perhaps Nova as a 
whole might benefit from switching to Cassandra for its database needs, but I 
haven't really thought about that at all.

> 3) The concept of the compute node updating the DB when things change is 
> really orthogonal to the new scheduling model.  The current scheduling model 
> would benefit from that as well.

Actually, it isn't that different. Compute nodes send updates to the scheduler 
when instances are created/deleted/resized/etc., so this isn't much of a 
stretch.

> 4) It seems to me that to avoid races we need to do one of the following.  
> Which are you proposing?
> a) Serialize the entire scheduling operation so that only one instance can 
> schedule at once.
> b) Make the evaluation of filters and claiming of resources a single atomic 
> DB transaction.
> c) Do a loop where we evaluate the filters, pick a destination, try to claim 
> the resources in the DB, and retry the whole thing if the resources have 
> already been claimed.

Probably a combination of b) and c). Filters would, for lack of a better term, 
add CSQL WHERE clauses to the query, which would return a set of acceptable 
hosts. Weighers would order these hosts in terms of desirability, and then the 
claim would be attempted. If the claim failed because the host had changed, the 
next acceptable host would be selected, etc. I don't imagine that "retrying the 
whole thing" would be an efficient option, unless there were no other 
acceptable hosts returned from the original filtering query.

Put another way: if we are in a racy situation, and two scheduler processes are 
trying to place a similar instance, both processes would most likely come up 
with the same set of hosts ordered in the same way. One of those processes 
would "win", and claim the first choice. The other would fail the transaction, 
and would then claim the second choice on the list. IMO, this is how you best 
deal with race conditions.


-- Ed Leafe





Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to