Hi, I've been wanting to make some structural changes to Zuul to round it out into a coherent system. I don't want to change it too much, but I'd also like a clean break with some of the baggage we've been carrying around from earlier decisions, and I want it to be able to continue to scale up (the config in particular is getting hard to manage with >500 projects).
I've batted a few ideas around with Monty, and I've written up my thoughts below. This is mostly a narrative exploration of what I think it should look like. This is not exhaustive, but I think it explores most of the major ideas. The next step is to turn this into a spec and start iterating on it and getting more detailed. I'm posting this here first for discussion to see if there are any major conceptual things that we should address before we get into more detailed spec review. Please let me know what you think. -Jim ======= Goals ======= Make zuul scale to thousands of projects. Make Zuul more multi-tenant friendly. Make it easier to express complex scenarios in layout. Make nodepool more useful for non virtual nodes. Make nodepool more efficient for multi-node tests. Remove need for long-running slaves. Make it easier to use Zuul for continuous deployment. To accomplish this, changes to Zuul's configuration syntax are proposed, making it simpler to manage large number of jobs and projects, along with a new method of describing and running jobs, and a new system for node distribution with Nodepool. ===================== Changes To Nodepool ===================== Nodepool should be made to support explicit node requests and releases. That is to say, it should act more like its name -- a node pool. Rather than having servers add themselves to the pool by waiting for them (or Jenkins on their behalf) to register with gearman, nodepool should instead define functions to supply nodes on demand. For example it might define the gearman functions "get-nodes" and "put-nodes". Zuul might request a node for a job by submitting a "get-nodes" job with the node type (eg "precise") as an argument. It could request two nodes together (in the same AZ) by supplying more than one node type in the same call. When complete, it could call "put-nodes" with the node identifiers to instruct nodepool to return them (nodepool might then delete, rebuild, etc). This model is much more efficient for multi-node tests, where we will no longer need to have special multinode labels. Instead the multinode configuration can be much more ad-hoc and vary per job. The testenv broker used by tripleo behaves somewhat in this manner (though it only supports static sets of resources). It also has logic to deal with the situation where Zuul might exit unexpectedly and not return nodes (though it should strive to do so). This feature in the broker should be added to nodepool. Additionally, nodepool should support fully static resources (they should become just another node type) so that it can handle the use case of the test broker. ================= Changes To Zuul ================= Zuul is currently fundamentally a single-tenant application. Some folks want to use it in a multi-tenant environment. Even within OpenStack, we have use for multitenancy. OpenStack might be one tenant, and each stackforge project might be another. Even if the big tent discussion renders that thinking obsolete, we may still want the kind of separation multi-tenancy can provide. The proposed implementation is flexible enough to run Zuul completely single tenant with shared everything, completely multi-tenant with shared nothing, and everything in-between. Being able to adjust just how much is shared or required, and how much can be left to individual projects will be very useful. To support this, the main configuration should define tenants, and tenants should specify config files to include. These include files should define pipelines, jobs, and projects, all of which are namespaced to the tenant (so different tenants may have different jobs with the same names):: ### main.yaml - tenant: name: openstack include: - global_config.yaml - openstack.yaml Files may be included by more than one tenant, so common items can be placed in a common file and referenced globally. This means that for, eg, OpenStack, we can define pipelines and our base job definitions (with logging info, etc) once, and include them in all of our tenants:: ### main.yaml (continued) - tenant: name: openstack-infra include: - global_config.yaml - infra.yaml A tenant may optionally specify repos from which it may derive its configuration. In this manner, a repo may keep its Zuul configuration within its own repo. This would only happen if the main configuration file specified that it is permitted:: ### main.yaml (continued) - tenant: name: random-stackforge-project include: - global_config.yaml repos: - stackforge/random # Specific project config is in-repo Jobs defined in-repo may not have access to the full feature set (including some authorization features). They also may not override existing jobs. Job definitions continue to have the features in the current Zuul layout, but they also take on some of the responsibilities currently handled by the Jenkins (or other worker) definition:: ### global_config.yaml # Every tenant in the system has access to these jobs (because their # tenant definition includes it). - job: name: base timeout: 30m node: precise # Just a variable for later use nodes: # The operative list of nodes - name: controller image: {node} # Substitute the variable auth: # Auth may only be defined in central config, not in-repo swift: - container: logs pre-run: # These specify what to run before and after the job - zuul-cloner post-run: - archive-logs Jobs have inheritance, and the above definition provides a base level of functionality for all jobs. It sets a default timeout, requests a single node (of type precise), and requests swift credentials to upload logs. Further jobs may extend and override these parameters:: ### global_config.yaml (continued) # The python 2.7 unit test job - job: name: python27 parent: base node: trusty Our use of job names specific to projects is a holdover from when we wanted long-lived slaves on jenkins to efficiently re-use workspaces. This hasn't been necessary for a while, though we have used this to our advantage when collecting stats and reports. However, job configuration can be simplified greatly if we simply have a job that runs the python 2.7 unit tests which can be used for any project. To the degree that we want to know how often this job failed on nova, we can add that information back in when reporting statistics. Jobs may have multiple aspects to accomodate differences among branches, etc.:: ### global_config.yaml (continued) # Version that is run for changes on stable/icehouse - job: name: python27 parent: base branch: stable/icehouse node: precise # Version that is run for changes on stable/juno - job: name: python27 parent: base branch: stable/juno # Could be combined into previous with regex node: precise # if concept of "best match" is defined Jobs may specify that they require more than one node:: ### global_config.yaml (continued) - job: name: devstack-multinode parent: base node: trusty # could do same branch mapping as above nodes: - name: controller image: {node} - name: compute image: {node} Jobs defined centrally (i.e., not in-repo) may specify auth info:: ### global_config.yaml (continued) - job: name: pypi-upload parent: base auth: password: pypi-password: pypi-password # This looks up 'pypi-password' from an encrypted yaml file # and adds it into variables for the job Pipeline definitions are similar to the current syntax, except that it supports specifying additional information for jobs in the context of a given project and pipeline. For instance, rather than specifying that a job is globally non-voting, you may specify that it is non-voting for a given project in a given pipeline:: ### openstack.yaml - project: name: openstack/nova gate: queue: integrated # Shared queues are manually built jobs: - python27 # Runs version of job appropriate to branch - devstack - devstack-deprecated-feature: branch: stable/juno # Only run on stable/juno changes voting: false # Non-voting post: jobs: - tarball: jobs: - pypi-upload Currently unique job names are used to build shared change queues. Since job names will no longer be unique, shared queues must be manually constructed by assigning them a name. Projects with the same queue name for the same pipeline will have a shared queue. A subset of functionality is avaible to projects that are permitted to use in-repo configuration:: ### stackforge/random/.zuul.yaml - job: name: random-job parent: base # From global config; gets us logs node: precise - project: name: stackforge/random gate: jobs: - python27 # From global config - random-job # Flom local config The executable content of jobs should be defined as ansible playbooks. Playbooks can be fairly simple and might consist of little more than "run this shell script" for those who are not otherwise interested in ansible:: ### stackforge/random/playbooks/random-job.yaml --- hosts: controller tasks: - shell: run_some_tests.sh Global jobs may define ansible roles for common functions:: ### openstack-infra/zuul-playbooks/python27.yaml --- hosts: controller roles: - tox: env: py27 Because ansible has well-articulated multi-node orchestration features, this permits very expressive job definitions for multi-node tests. A playbook can specify different roles to apply to the different nodes that the job requested:: ### openstack-infra/zuul-playbooks/devstack-multinode.yaml --- hosts: controller roles: - devstack --- hosts: compute roles: - devstack-compute Additionally, if a project is already defining ansible roles for its deployment, then those roles may be easily applied in testing, making CI even closer to CD. Finally, to make Zuul more useful for CD, Zuul may be configured to run a job (ie, ansible role) on a specific node. The pre- and post-run entries in the job definition might also apply to ansible playbooks and can be used to simplify job setup and cleanup:: ### openstack-infra/zuul-playbooks/zuul-cloner.yaml --- hosts: all roles: - zuul-cloner: {{zuul}} Where the zuul variable is a dictionary containing all the information currently transmitted in the ZUUL_* environment variables. Similarly, the log archiving script can copy logs from the host to swift. A new Zuul component would be created to execute jobs. Rather than running a worker process on each node (which requires installing software on the test node, and establishing and maintaining network connectivity back to Zuul, and the ability to coordinate actions across nodes for multi-node tests), this new component will accept jobs from Zuul, and for each one, write an ansible inventory file with the node and variable information, and then execute the ansible playbook for that job. This means that the new Zuul component will maintain ssh connections to all hosts currently running a job. This could become a bottleneck, but ansible and ssh have been known to scale to a large number of simultaneous hosts, and this component may be scaled horizontally. It should be simple enough that it could even be automatically scaled if needed. In turn, however, this does make node configuration simpler (test nodes need only have an ssh public key installed) and makes tests behave more like deployment. _______________________________________________ OpenStack-Infra mailing list OpenStack-Infra@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra