areusch commented on a change in pull request #49: URL: https://github.com/apache/tvm-rfcs/pull/49#discussion_r780359577
########## File path: rfcs/0049-managed-jenkins-infrastructure-for-tvm.md ########## @@ -0,0 +1,136 @@ +# Managed Jenkins Infrastructure for TVM + +- Feature Name: `managed_jenkins_infra` +- Start Date: 2022-01-03 +- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0049) +- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000) +- Pre-RFC: https://discuss.tvm.apache.org/t/pre-rfc-managed-jenkins-infrastructure-for-tvm/11692 + +Authored-by: [Andrew Reusch](https://github.com/areusch)(@areusch) + +Authored-by: [Noah Kontur](https://github.com/konturn)(@konturn) + +See also: PoC of the Infrastructure-as-Code repos: +- Ansible and Jenkins config: https://github.com/octoml/tvm-ci +- Terraform: https://github.com/octoml/tvm-ci-terraform +- Packer: https://github.com/octoml/tvm-ci-packer + +## Background and Motivations + +The Apache TVM project relies on Jenkins for Continuous Integration services. At present, Jenkins is maintained by a small set of folks, many of whom are core committers or who serve on the PMC. As the project grows and the maintenance burden increases, we find that it could be beneficial to the project as well as the current Jenkins maintainers to adopt a more modern, Infrastructure-as-Code approach to maintaining the fleet of machines and the web services responsible for the TVM CI. + +### Architectural Overview + + + +At a high level, the proposed architecture layout is similar to what currently exists for TVM CI; namely, a leader VM in AWS will run the Jenkins GUI and assign pipeline jobs to agent VM's. As before, the Jenkins service on the leader VM will run via docker, and the leader will assign jobs to the agents via SSH authentication. While there will certainly be some architectural difference between this setup and the old one—agents will likely be deployed in autoscaling groups, and they will likely utilize a shared cache mechanism for builds via EFS or S3—the primary differences involve how provisioning/configuration is done: + +1. Packer will be used to provision baseline images for all the agent and head node VM's. These images will be stored in AWS' AMI store, and will be updated periodically when necessary. +2. Terraform will be used to manage the infrastructural components of Jenkins CI such as the head node, agent autoscaling groups, and the load balancer handling SSL termination to the Jenkins leader VM. This way, infrastructural changes can be versioned and vetted in a publicly-available repository. +3. Ansible will be used to configure the Jenkins head node, and will thus handle items like Jenkins Job configuration (e.g. how often nightly builds run) and authentication methods. As with Terraform, the Ansible code will be made publicly-available. + +It will likely be the case that the Terraform and Ansible code will reside in different repositories, as they will likely utilize different deploy paradigms. The former will likely leverage [Atlantis pull request automation](https://www.runatlantis.io/), which essentially allows contributors to run and review terraform plans by issuing comments on a PR. On the other hand, the ansible playbooks used to configure Jenkins will be run using Github Actions. If it is desirable to reduce complexity, we could use the same deploy tool for both. + +### Theory of Operation + +Under normal conditions, the system operates as follows: + +1. The Jenkins master node is configured with a Pipeline Multibranch project. The project source tree is set to the official Apache TVM GitHub repository. +2. A GitHub [webhook](https://docs.github.com/en/developers/webhooks-and-events/webhooks/about-webhooks) notifies the Jenkins master when any branch or PR is updated in the Apache TVM repository. +3. The Jenkins master schedules a build for each notification it receives. +4. When it is time to start the build (the Jenkins [quiet period](https://www.jenkins.io/blog/2010/08/11/quiet-period-feature/) expires), Jenkins notifies GitHub and executes the `Jenkinsfile` to be used for the build. + - NOTE: for PR builds, the `Jenkinsfile` used is always the one checked-in to the target merge branch (i.e. `main` for all practical purposes here). This is due to convention from the [Multibranch Pipeline plugin](https://github.com/jenkinsci/workflow-multibranch-plugin). +5. The TVM `Jenkinsfile` specifies a multi-stage build, each stage containing a set of parallel jobs which run on specific types of machines (machine types are identified from a `label` which is specified on [`node`](https://www.jenkins.io/doc/book/pipeline/syntax/#agent-parameters) lines in `Jenkinsfile`). These machine labels are also present in the TVM Jenkins master configuration. Currently, TVM CI supports these labels with these meanings: + - `CPU` - an x86_64 machine with no specific GPU requirement which can execute `ci-lint`, `ci-cpu`, `ci-wasm`, `ci-qemu`, and `ci-i386` containers + - `GPU` - an x86_64 machine with a specific GPU which can execute `ci-gpu` containers + - `GPUBUILD` - an x86_64 machine with CUDA and other GPU libraries present (such that `ci-gpu` can execute), but not necessarily with the GPU used in TVM CI unit tests. Used to build TVM and unit tests which can be run on `GPU` nodes. + - `ARM` - an AArch64 machine which can run `ci-arm` containers. + - `TensorCore` - an alias for `GPU` (historically this specified a machine with a more powerful GPU) + - `doc` - a machine which serves the last-built docs from `main` +6. Jenkins finds an **executor** machine for each job. Executors are machines running in AWS or other public clouds (e.g. public machine types in Azure, GCP, etc) which are running the Jenkins agent. Jenkins dispatches the job to the executor and awaits the results. +7. When a job in any stage fails, the build is aborted. Otherwise, the build proceeds through all stages. +8. When the build is completed, Jenkins notifies GitHub of the result, and the PR or `main` branch is updated. + +### Autoscaler + +Jenkins executor nodes can be classified into two groups: + +1. **Static nodes** are long-lived instances managed by Terraform. The Jenkins master is configured to connect to static nodes at startup and expects them to continue to stay alive for the life of the Jenkins master process. +2. **Autoscaled nodes** are cloud instances that are created by the Jenkins master in response to PR workload. As the build queue grows longer, Jenkins can choose to create additional executors to alleviate developer wait time. Autoscaled nodes persist for an adjustable period of time after they become idle. + +At launch time, we intend to use only static nodes. However, autoscaled nodes have been tested internally and we will begin to use those sometime in Q1 2022. Autoscaled nodes present a debugging challenge, as flaky tests or non-repeatable errors will need to be diagnosed before the autoscaled node is decommissioned automatically by the Jenkins master. + +### Infrastructure-as-Code Repository + +The production TVM CI instance will be managed using an open source Infrastructure-as-Code repository living in GitLab. GitLab is preferable for DevOps workflows due to a slightly nicer pipelines system, particularly one which allows for manual intervention if needed. All configuration except credentials will be stored in this repository. TVM Committers, plus additional delegates of those committers responsible for running the TVM Jenkins infrastructure, will be granted write access to this repository. Any changes to this repository will require review from those individuals with write access who are actively involved in the day-to-day operations of TVM CI. Review comment: oops, this was a decision we walked back after some consideration (e.g. better to stick with the same platform rather than have two). i missed this mention in my editing; fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
