Thanks Jim. This makes a lot of sense and will hopefully make things simpler and more robust.
Just a few questions: 1. It looks like zuul can request a specific set of nodes for a job. Do you envision the typical ansible playbook to install additional things required for the jobs? or would zuul always need to request a suitable node for the job? 2. Would there be a way to share environment variables across multiple shell tasks? For example would it be possible to reference a variable defined in the job yaml file from inside of a shell script? -Khai On Thu, Feb 26, 2015 at 8:59 AM, James E. Blair <[email protected]> wrote: > Hi, > > I've been wanting to make some structural changes to Zuul to round it > out into a coherent system. I don't want to change it too much, but I'd > also like a clean break with some of the baggage we've been carrying > around from earlier decisions, and I want it to be able to continue to > scale up (the config in particular is getting hard to manage with >500 > projects). > > I've batted a few ideas around with Monty, and I've written up my > thoughts below. This is mostly a narrative exploration of what I think > it should look like. This is not exhaustive, but I think it explores > most of the major ideas. The next step is to turn this into a spec and > start iterating on it and getting more detailed. > > I'm posting this here first for discussion to see if there are any > major conceptual things that we should address before we get into more > detailed spec review. Please let me know what you think. > > -Jim > > ======= > Goals > ======= > > Make zuul scale to thousands of projects. > Make Zuul more multi-tenant friendly. > Make it easier to express complex scenarios in layout. > Make nodepool more useful for non virtual nodes. > Make nodepool more efficient for multi-node tests. > Remove need for long-running slaves. > Make it easier to use Zuul for continuous deployment. > > To accomplish this, changes to Zuul's configuration syntax are > proposed, making it simpler to manage large number of jobs and > projects, along with a new method of describing and running jobs, and > a new system for node distribution with Nodepool. > > ===================== > Changes To Nodepool > ===================== > > Nodepool should be made to support explicit node requests and > releases. That is to say, it should act more like its name -- a node > pool. > > Rather than having servers add themselves to the pool by waiting for > them (or Jenkins on their behalf) to register with gearman, nodepool > should instead define functions to supply nodes on demand. For > example it might define the gearman functions "get-nodes" and > "put-nodes". Zuul might request a node for a job by submitting a > "get-nodes" job with the node type (eg "precise") as an argument. It > could request two nodes together (in the same AZ) by supplying more > than one node type in the same call. When complete, it could call > "put-nodes" with the node identifiers to instruct nodepool to return > them (nodepool might then delete, rebuild, etc). > > This model is much more efficient for multi-node tests, where we will > no longer need to have special multinode labels. Instead the > multinode configuration can be much more ad-hoc and vary per job. > > The testenv broker used by tripleo behaves somewhat in this manner > (though it only supports static sets of resources). It also has logic > to deal with the situation where Zuul might exit unexpectedly and not > return nodes (though it should strive to do so). This feature in the > broker should be added to nodepool. Additionally, nodepool should > support fully static resources (they should become just another node > type) so that it can handle the use case of the test broker. > > ================= > Changes To Zuul > ================= > > Zuul is currently fundamentally a single-tenant application. Some > folks want to use it in a multi-tenant environment. Even within > OpenStack, we have use for multitenancy. OpenStack might be one > tenant, and each stackforge project might be another. Even if the big > tent discussion renders that thinking obsolete, we may still want the > kind of separation multi-tenancy can provide. The proposed > implementation is flexible enough to run Zuul completely single tenant > with shared everything, completely multi-tenant with shared nothing, and > everything in-between. Being able to adjust just how much is shared or > required, and how much can be left to individual projects will be very > useful. > > To support this, the main configuration should define tenants, and > tenants should specify config files to include. These include files > should define pipelines, jobs, and projects, all of which are > namespaced to the tenant (so different tenants may have different jobs > with the same names):: > > ### main.yaml > - tenant: > name: openstack > include: > - global_config.yaml > - openstack.yaml > > Files may be included by more than one tenant, so common items can be > placed in a common file and referenced globally. This means that for, > eg, OpenStack, we can define pipelines and our base job definitions > (with logging info, etc) once, and include them in all of our tenants:: > > ### main.yaml (continued) > - tenant: > name: openstack-infra > include: > - global_config.yaml > - infra.yaml > > A tenant may optionally specify repos from which it may derive its > configuration. In this manner, a repo may keep its Zuul configuration > within its own repo. This would only happen if the main configuration > file specified that it is permitted:: > > ### main.yaml (continued) > - tenant: > name: random-stackforge-project > include: > - global_config.yaml > repos: > - stackforge/random # Specific project config is in-repo > > Jobs defined in-repo may not have access to the full feature set > (including some authorization features). They also may not override > existing jobs. > > Job definitions continue to have the features in the current Zuul > layout, but they also take on some of the responsibilities currently > handled by the Jenkins (or other worker) definition:: > > ### global_config.yaml > # Every tenant in the system has access to these jobs (because their > # tenant definition includes it). > - job: > name: base > timeout: 30m > node: precise # Just a variable for later use > nodes: # The operative list of nodes > - name: controller > image: {node} # Substitute the variable > auth: # Auth may only be defined in central config, not in-repo > swift: > - container: logs > pre-run: # These specify what to run before and after the job > - zuul-cloner > post-run: > - archive-logs > > Jobs have inheritance, and the above definition provides a base level > of functionality for all jobs. It sets a default timeout, requests a > single node (of type precise), and requests swift credentials to > upload logs. Further jobs may extend and override these parameters:: > > ### global_config.yaml (continued) > # The python 2.7 unit test job > - job: > name: python27 > parent: base > node: trusty > > Our use of job names specific to projects is a holdover from when we > wanted long-lived slaves on jenkins to efficiently re-use workspaces. > This hasn't been necessary for a while, though we have used this to > our advantage when collecting stats and reports. However, job > configuration can be simplified greatly if we simply have a job that > runs the python 2.7 unit tests which can be used for any project. To > the degree that we want to know how often this job failed on nova, we > can add that information back in when reporting statistics. Jobs may > have multiple aspects to accomodate differences among branches, etc.:: > > ### global_config.yaml (continued) > # Version that is run for changes on stable/icehouse > - job: > name: python27 > parent: base > branch: stable/icehouse > node: precise > > # Version that is run for changes on stable/juno > - job: > name: python27 > parent: base > branch: stable/juno # Could be combined into previous with regex > node: precise # if concept of "best match" is defined > > Jobs may specify that they require more than one node:: > > ### global_config.yaml (continued) > - job: > name: devstack-multinode > parent: base > node: trusty # could do same branch mapping as above > nodes: > - name: controller > image: {node} > - name: compute > image: {node} > > Jobs defined centrally (i.e., not in-repo) may specify auth info:: > > ### global_config.yaml (continued) > - job: > name: pypi-upload > parent: base > auth: > password: > pypi-password: pypi-password > # This looks up 'pypi-password' from an encrypted yaml file > # and adds it into variables for the job > > Pipeline definitions are similar to the current syntax, except that it > supports specifying additional information for jobs in the context of > a given project and pipeline. For instance, rather than specifying > that a job is globally non-voting, you may specify that it is > non-voting for a given project in a given pipeline:: > > ### openstack.yaml > - project: > name: openstack/nova > gate: > queue: integrated # Shared queues are manually built > jobs: > - python27 # Runs version of job appropriate to branch > - devstack > - devstack-deprecated-feature: > branch: stable/juno # Only run on stable/juno changes > voting: false # Non-voting > post: > jobs: > - tarball: > jobs: > - pypi-upload > > Currently unique job names are used to build shared change queues. > Since job names will no longer be unique, shared queues must be > manually constructed by assigning them a name. Projects with the same > queue name for the same pipeline will have a shared queue. > > A subset of functionality is avaible to projects that are permitted to > use in-repo configuration:: > > ### stackforge/random/.zuul.yaml > - job: > name: random-job > parent: base # From global config; gets us logs > node: precise > > - project: > name: stackforge/random > gate: > jobs: > - python27 # From global config > - random-job # Flom local config > > The executable content of jobs should be defined as ansible playbooks. > Playbooks can be fairly simple and might consist of little more than > "run this shell script" for those who are not otherwise interested in > ansible:: > > ### stackforge/random/playbooks/random-job.yaml > --- > hosts: controller > tasks: > - shell: run_some_tests.sh > > Global jobs may define ansible roles for common functions:: > > ### openstack-infra/zuul-playbooks/python27.yaml > --- > hosts: controller > roles: > - tox: > env: py27 > > Because ansible has well-articulated multi-node orchestration > features, this permits very expressive job definitions for multi-node > tests. A playbook can specify different roles to apply to the > different nodes that the job requested:: > > ### openstack-infra/zuul-playbooks/devstack-multinode.yaml > --- > hosts: controller > roles: > - devstack > --- > hosts: compute > roles: > - devstack-compute > > Additionally, if a project is already defining ansible roles for its > deployment, then those roles may be easily applied in testing, making > CI even closer to CD. Finally, to make Zuul more useful for CD, Zuul > may be configured to run a job (ie, ansible role) on a specific node. > > The pre- and post-run entries in the job definition might also apply > to ansible playbooks and can be used to simplify job setup and > cleanup:: > > ### openstack-infra/zuul-playbooks/zuul-cloner.yaml > --- > hosts: all > roles: > - zuul-cloner: {{zuul}} > > Where the zuul variable is a dictionary containing all the information > currently transmitted in the ZUUL_* environment variables. Similarly, > the log archiving script can copy logs from the host to swift. > > A new Zuul component would be created to execute jobs. Rather than > running a worker process on each node (which requires installing > software on the test node, and establishing and maintaining network > connectivity back to Zuul, and the ability to coordinate actions across > nodes for multi-node tests), this new component will accept jobs from > Zuul, and for each one, write an ansible inventory file with the node > and variable information, and then execute the ansible playbook for that > job. This means that the new Zuul component will maintain ssh > connections to all hosts currently running a job. This could become a > bottleneck, but ansible and ssh have been known to scale to a large > number of simultaneous hosts, and this component may be scaled > horizontally. It should be simple enough that it could even be > automatically scaled if needed. In turn, however, this does make node > configuration simpler (test nodes need only have an ssh public key > installed) and makes tests behave more like deployment. > > _______________________________________________ > OpenStack-Infra mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra >
_______________________________________________ OpenStack-Infra mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
