Jesse J. CookCompute Team Lead [email protected] irc: #compute-eng (gimchi) mobile: 618-530-0659 <https://rackspacemarketing.com/signatyourEmail/> <https://www.linkedin.com/pub/jesse-cook/8/292/620> <https://plus.google.com/u/0/+JesseCooks/posts/p/pub>
On 7/20/15, 12:40 PM, "Clint Byrum" <[email protected]> wrote: >Excerpts from Jesse Cook's message of 2015-07-20 07:48:46 -0700: >> >> On 7/15/15, 9:18 AM, "Ed Leafe" <[email protected]> wrote: >> >> >-----BEGIN PGP SIGNED MESSAGE----- >> >Hash: SHA512 >> > >> >Changing the architecture of a complex system such as Nova is never >> >easy, even when we know that the design isn't working as well as we >> >need it to. And it's even more frustrating because when the change is >> >complete, it's hard to know if the improvement, if any, was worth it. >> > >> >So I had an idea: what if we ran a test of that architecture change >> >out-of-tree? In other words, create a separate deployment, and rip out >> >the parts that don't work well, replacing them with an alternative >> >design. There would be no Gerrit reviews or anything that would slow >> >down the work or add load to the already overloaded reviewers. Then we >> >could see if this modified system is a significant-enough improvement >> >to justify investing the time in implementing it in-tree. And, of >> >course, if the test doesn't show what was hoped for, it is scrapped >> >and we start thinking anew. >> >> +1 >> > >> >The important part in this process is defining up front what level of >> >improvement would be needed to make considering actually making such a >> >change worthwhile, and what sort of tests would demonstrate whether or >> >not whether this level was met. I'd like to discuss such an experiment >> >next week at the Nova mid-cycle. >> > >> >What I'd like to investigate is replacing the current design of having >> >the compute nodes communicating with the scheduler via message queues. >> >This design is overly complex and has several known scalability >> >issues. My thought is to replace this with a Cassandra [1] backend. >> >Compute nodes would update their state to Cassandra whenever they >> >change, and that data would be read by the scheduler to make its host >> >selection. When the scheduler chooses a host, it would post the claim >> >to Cassandra wrapped in a lightweight transaction, which would ensure >> >that no other scheduler has tried to claim those resources. When the >> >host has built the requested VM, it will delete the claim and update >> >Cassandra with its current state. >> > >> >One main motivation for using Cassandra over the current design is >> >that it will enable us to run multiple schedulers without increasing >> >the raciness of the system. Another is that it will greatly simplify a >> >lot of the internal plumbing we've set up to implement in Nova what we >> >would get out of the box with Cassandra. A third is that if this >> >proves to be a success, it would also be able to be used further down >> >the road to simplify inter-cell communication (but this is getting >> >ahead of ourselves...). I've worked with Cassandra before and it has >> >been rock-solid to run and simple to set up. I've also had preliminary >> >technical reviews with the engineers at DataStax [2], the company >> >behind Cassandra, and they agreed that this was a good fit. >> > >> >At this point I'm sure that most of you are filled with thoughts on >> >how this won't work, or how much trouble it will be to switch, or how >> >much more of a pain it will be, or how you hate non-relational DBs, or >> >any of a zillion other negative thoughts. FWIW, I have them too. But >> >instead of ranting, I would ask that we acknowledge for now that: >> >> Call me an optimist, I think this can work :) >> >> I would prefer a solution that avoids state management all together and >> instead depends on each individual making rule-based decisions using >>their >> limited observations of their perceived environment. Of course, this has >> certain emergent behaviors you have to learn from, but on the upside, no >> more braiding state throughout the system. I don¹t like the assumption >> that it has to be a global state management problem when it doesn¹t have >> to be. That being said, I¹m not opposed to trying a solution like you >> described using Cassandra or something similar. I generally support >> improvements :) >> > > >> > >> >a) it will be disruptive and painful to switch something like this at >> >this point in Nova's development >> >b) it would have to provide *significant* improvement to make such a >> >change worthwhile >> > >> >So what I'm asking from all of you is to help define the second part: >> >what we would want improved, and how to measure those benefits. In >> >other words, what results would you have to see in order to make you >> >reconsider your initial "nah, this'll never work" reaction, and start >> >to think that this is will be a worthwhile change to make to Nova. >> >> I¹d like to see n build requests within 1 second each be successfully >> scheduled to a host that has spare capacity with only say a total system >> capacity of n * 1.10 where n >= 10000, each cell having ~100 hosts, the >> number of hosts is >= n * 0.10 and <= n * 0.90, and the number of >> schedulers is >= 2. >> >> For example: >> >> Build requests: 10000 in 1 second >> Slots for flavor requested: 11000 >> Hosts that can build flavor: 7500 >> Number of schedulers: 3 >> Number of cells: 75 (each with 100 hosts) >> > >This is right on, though one thing missing is where the current code >fails this test. It would be great to have the numbers above available >as a baseline so we can denote progress in any experiment. The cell level scheduling code over-schedules to cells and cannot retry. > >Also, I'm a little confused why you'd want cells still, but perhaps the >idea is to get the scale of one cell so high, you don't actually ever >want cells, since at that point you should really be building new regions? Cells are just another horizontal scaling construct. There are good use cases for them especially in a world of horizontal scaling. For example, operators standing up more servers in a region to go online at some point in the near future. > >To your earlier point about state being abused in the system, I >totally 100% agree. In the past I've wondered a lot if there can be a >worker model, where compute hosts all try to grab work off queues if >they have available resources. So API requests for boot/delete don't >change any state, they just enqueue a message. Queues would be matched >up to resources and the more filter choices, the more queues. Each >time a compute node completed a task (create vm, destroy vm) it would >re-evaluate all of the queues and subscribe to the ones it could satisfy >right now. Quotas would simply be the first stop for the enqueued create >messages, and a final stop for the enqueued delete messages (once its >done, release quota). If you haven't noticed, this would agree with Robert >Collins's suggestion that something like Kafka is a technology more suited >to this (or my favorite old-often-forgotten solution to this , Gearman. ;) > >This would have no global dynamic state, and very little local dynamic >state. API, conductor, and compute nodes simply need to know all of the >choices users are offered, and there is no scheduler at runtime, just >a predictive queue-list-manager that only gets updated when choices are >added or removed. This would relieve a ton of the burden currently put >on the database by scheduling since the only accesses would be simple >read/writes (that includes 'server-list' type operations since that >would read a single index key). I think we are much thinking in the same way here. I like the general approach. > >Anyway, That's way off track, but I think this kind of thinking needs >to happen and be taken seriously without turning into a bikeshed or fist >fight. I don't think that will happen naturally until we start measuring >where we are, and listening to operators as to where they'd like to be >in relation to that. > >So in the interest of ending a long message with actions rather than >words, lets get some measurement going. Rally? Something else? What can >we do to measure this? Performance tests against 1000 node clusters being setup by OSIC? Sounds like you have a playground for your tests. > >__________________________________________________________________________ >OpenStack Development Mailing List (not for usage questions) >Unsubscribe: [email protected]?subject:unsubscribe >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
