Ian, your suggestion of retrieving changes since a timestamp is good.  When a 
scheduler first comes online (in an HA context), it requests compute node 
status providing Null for timestamp to retrieve everything.

It also paves the way for full in memory record of all compute node status 
because it requires that each scheduler keep a copy of the status.

The Scheduler could retrieve status every second or whenever it gets a new VM 
request. Under heavy load, that is frequent requests, the timestamps would be 
closer and hopefully fewer changes being returned. We may want to make the 
frequency of polling a configurable item
to tune .. too infrequent means payload large (no worse than today's full 
load), too often may  be moot.

Regards
Malini


-----Original Message-----
From: Ian Wells [mailto:[email protected]] 
Sent: Monday, July 22, 2013 1:56 PM
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] The PCI support blueprint

On 22 July 2013 21:08, Boris Pavlovic <[email protected]> wrote:
> Ian,
>
> I don't like to write anything personally.
> But I have to write some facts:
>
> 1) I see tons of hands and only 2 solutions my and one more that is 
> based on code.
> 2) My code was published before session (18. Apr 2013)
> 3) Blueprints from summit were published (03. Mar 2013)
> 4) My Blueprints were published (25. May 2013)
> 5) Patches based on my patch were published only (5. Jul 2013)

Absolutely.  Your patch and our organisation crossed in the mail, and everyone 
held off work on this because you were working on this.
That's perfectly normal, just unfortunate, and I'm grateful for your work on 
this, not pointing fingers.

> After making investigations and tests we found that one of the reason 
> why scheduler works slow and has problems with scalability is work with DB.
> JOINS are pretty unscalable and slow thing and if we add one more JOIN 
> that is required by PCI passthrough we will get much worse situation.

Your current PCI passthrough design adds a new database that stores every PCI 
device in the cluster, and you're thinking of crossing that with the compute 
node and its friends.  That's certainly unscalable.

I think the issue here is, in fact, more that you're storing every PCI device.  
The scheduler doesn't care.  In most cases, devices are equivalent, so instead 
of storing 1024 devices you can store one single row in the stats table saying 
pci_device_class_networkcard = 1024.  There may be a handful of these classes, 
but there won't be
1024 of them per cluster node.  The compute node can take any one of the PCI 
devices in that class and use it - the scheduler should neither know nor care.

This drastically reduces the transfer of information from the compute node to 
host and also reduces the amount of data you need to store in the database - 
and the scheduler DB doesn't need changing at all.

This seems like a much more low impact approach for now - it doesn't change the 
database at all and it and doesn't add much to the scheduling problem (indeed, 
no overhead at all for the non-PCI users) until we solve the scalability issues 
you're talking about at some later date.

For what it's worth, one way of doing that without drastic database design 
would be to pass compute_node_get_all a timestamp, return only stats updated 
since that timestamp, return a new timestamp, and merge that in with what the 
scheduler already knows about.  There's some refinement to that - since 
timestamps are not reliable clocks in databases - but it reduces the flow of 
data from the DB file substantially and works with an eventually consistent 
system.
(Truthfully, I prefer your in-memory-store idea, there's nothing about these 
stats that really needs to survive a reboot of the control node, but this might 
be a quick fix.)
--
Ian.

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to