[openstack-dev] [nova] cells v2 ocata summit sessions recap

Matt Riedemann Tue, 08 Nov 2016 12:28:18 -0800

We had two design summit sessions for cells v2 at the Ocata summit inBarcelona. The full etherpads are here:


https://etherpad.openstack.org/p/ocata-nova-summit-cellsv2-scheduler
https://etherpad.openstack.org/p/ocata-nova-summit-cellsv2-quotas

In the first session we mostly talked about the work items needed tosupport multiple cells.


Scheduler interaction
---------------------

Andrew Laski had the spec up for this and there was some POC codestarted, which Dan Smith is picking up. The main idea here is to createthe BuildRequest in nova-api, then call a new conductor method whichcalls the scheduler to pick a host, and then conductor creates theinstance in the cell mapped to that host, and then conductor deletes theBuildRequest. Once the instance is in that cell, it doesn't move out ofthat cell via reschedule/rebuild/migration/evacuate. We might addsupport for that later, but it's not something supported in the initialcells v2 scheduler work.


Cell0
-----

Cell0 is the special database (using the same cell DB schema) whereinstances that failed to build go to die. In Newton this was optionalbut the API will pull instances from it when listing instances. However,we aren't populating it on schedule failure yet so that's work thatneeds to be done for Ocata. So far there are no patches up for this yet,but it will most likely be worked in with the scheduler interaction series.


The listing instances problem
-----------------------------

In order to list instances we need to be able to page across cells,which is going to be expensive for multiple large cells. Long-term wewant to use searchlight for this, but in the short term we're going todo a simple merge sort in python. We also need to get the fixes in torestrict the filter/sort keys for listing instances, Alex Xu and KevinZheng are working on that. There are open questions on the long-termplans of how things are going to work with searchlight, and Chris Dentsaid he was interested in working on that. As a start, the searchlightteam has started reporting bugs against nova for gaps in thenotifications that nova sends out compared to the REST API. BalazsGibizer (gibi) who runs the notifications subteam in nova has alreadystarted triaging those bugs.


Testing
-------

CI testing the multi-cell configuration should be relativelystraight-forward with a multinode job. We can have one node contain the'control' bits like the API, conductor, scheduler, API/cell0/cell DBs,along with a nova-compute service, and then another node that is justrunning nova-compute. I've volunteered to work on that.

Dan and I have also started working on a series of changes to makecellsv2 required in the Ocata CI jobs:


https://review.openstack.org/#/q/topic:ocata-requires-cellsv2

This consists of a nova database migration that fails if you haven'tcreated the cell0 database and run the simple_cell_setup command.


Upgrade
-------

Today in non-cells deployments we say to upgrade conductors first, thenAPI and finally computes. With cells v2 we want to do rolling upgradesof the cells, which means upgrading conductors and then computes in thecells, and finally the API. This allows you to only enable the latestfeatures in the API when the cells are all upgraded and ready to handlethose requests. This poses a bit of a problem for our CI tooling withdevstack/grenade though as we upgrade and start nova-api first. That'sgoing to require some changes as we get into the multi-cell testingmentioned above.


Quotas
------

In the second design summit session on cells v2 we spent the entire timetalking about quotas, and thinking about ways to potentially redo thequotas design when moving those to the API database.

Melanie Witt has a spec up which proposes that instead of doing thereserve (API), do work, commit/rollback (compute) model, we move commitsto the API and have a process of (1) do work and then (2) reserve andcommit.

While talking about that in the session, there was some brainstorming ondoing limit checks differently in the API. Basically, do away withreservations, do a quick DB query to check quota before an operationbegins, and if it's OK go forward. There will be races as tenants reachthe end of quota limits, but maybe this is OK. The upside is weshouldn't get out of sync (which has been a problem for operators todeal with), but there is a potential to race for the last usages whichmight lead to overconsumption of resources.

There are some open questions around the 'fast and loose' approach likeare there things we need to count which aren't in the API database, andcan we do this efficiently for things that nova doesn't track in it'sdatabase, like floating/fixed IPs in neutron?


--

Thanks,

Matt Riedemann


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [nova] cells v2 ocata summit sessions recap

Reply via email to