I've spent the last few days digesting the results of Luke's queueing experiments (see e-mails "Asynchronous catalog compiles", "Asynchrony, take 2", and "Asynchrony, take 3"), and reviewing them with Luke, Jesse W, Jacob H, and Teyo T in an effort to figure out a good way to move forward from experiment to concrete implementation.
To start things off, I'd like to try to list most salient customer-visible features that have been motivating our foray into asynchrony and queueing. Once we've done that, it should be easier to choose a subset of functionality to target for 2.7 that is valuable to a lot of people without requiring years to implement. I suspect we may discover that some of the features that motivated the investigation into queueing and asynchrony could be built today, without needing an architectural change, and if so that might be a big win in terms of implementation effort. Here are the salient features that I've culled from the e-mail discussions and meetings with Luke and others, in no particular order. Please feel free to make comments and corrections, talk about what's most important to you personally, and especially to add your own items if you think I'm missing something important. Features: 1. Make low-end scaling easier: currently when a customer's deployment gets too large to be handled by a single master running Webrick, they have to install apache/passenger or mongrel. This can be difficult to do, especially since passenger has limited package support on some OSes (notably RHEL/Centos 5). It would be nice to give people a less painful way of scaling up beyond what Webrick is capable of handling. 2. Make medium-to-high-end scaling easier: currently when a customer's deployment gets too large to be handled by a single physical machine, they have to set up a load balancing infrastructure to distribute HTTPS requests from client machines to a suite of puppet masters. It would be nice to give people a way of adding CPUs to the problem (essentially creating a "catalog compiler farm") without forcing them to add a layer of infrastructure. 3. Allow customers who already have a queueing system as part of their infrastructure to use it to scale Puppet, so they don't have to implement a special Puppet-specific piece of infrastructure. 4. Make Puppet handle load spikes more robustly. Currently I understand from Luke that there is an avalance effect once the master reaches 100% capacity, wherein the machine starts thrashing and actually loses throughput, causing further load increases. It would be nice if we could guarantee that Puppet didn't try to serve more simultaneous requests than its processors/memory could handle, so that performance would degrade more gracefully in times of high load. 5. Allow customers to prioritize compilations for some client machines over others, so that mission critical updates aren't delayed. 6. Allow a "push model", where changing the manifest causes catalogs to be recompiled, and those catalogs are then pushed out to client machines rather than waiting for the client machines to contact the master and request catalogs. 7. Allow inter-machine dependencies to be updated faster (i.e. if machine A's configuration depends on the stored configuration of machine B, then when a new catalog gets sent to machine B, push an update to machine A ASAP rather than waiting for it to contact the master and request a catalog). 8. Allow the fundamental building blocks of Puppet to be decomposed more easily by advanced customers so that they can build in their own functionality, especially with respect to caching, request routing, and reporting. For example, a customer might decide that instead of building a brand new catalog in response to every catalog request, they might want to send a standard pre-built catalog to some clients. Customers should be able to do things like this (and make other extensions to puppet that we cannot anticipate) by putting together the building blocks of puppet in their own unique ways. 9. Allow for staged rollouts--a customer may want to update a manifest on the master but have the change propagate to client machines in a controlled fashion over several days, rather than automatically deploying each machine whenever it happens to contact the puppet master next. 10. Allow for faster file serving by allowing a client machine to request multiple files in parallel rather than making a separate REST request for every single file. 11. Allow for fail-over: if one puppet master crashes, allow other puppet masters to transparently take over the work it was doing. Note that these features come with a number of caveats: A. We don't want to take a big performance hit in order to add these features, especially since many of the features concern scalability and hence are performance critical. B. We don't want to break existing features or introduce new dependencies (e.g. customers whose deployment is small enough that they don't have major scalibility problems should be able to continue using HTTPS/Webrick). C. We don't want to unnecessarily duplicate effort in the code base (e.g. we wouldn't want to write a complete queueing infrastructure independent of the indirector that served much the same purpose) D. We don't want to break compaitibility with older (2.6 and possibly 0.25) clients. E. We don't want to sacrifice error handling, and we want the system to be at least as robust as 0.25 and 2.6 if a puppet master crashes. F. We don't want to lose (or waste a lot of time re-implementing) the features that we get "for free" from HTTPS/REST, such as: synchronous file delivery, ability to tunnel/proxy through firewalls, and a the security of SSL. -- You received this message because you are subscribed to the Google Groups "Puppet Developers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-dev?hl=en.
