areusch commented on code in PR #11403:
URL: https://github.com/apache/tvm/pull/11403#discussion_r883067342


##########
jenkins/README.md:
##########
@@ -26,3 +137,90 @@ pip install -r jenkins/requirements.txt
 python jenkins/generate.py

Review Comment:
   while we're here, maybe we should change to use a venv:
   ```
   python3 -mvenv _venv
   _venv/bin/pip3 install -r jenkins/requirements.txt
   _venv/bin/python3 jenkins/generate.py
   ```
   
   we could consider adding to `Makefile`



##########
jenkins/README.md:
##########
@@ -15,8 +15,119 @@
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->
 
+# TVM CI
+
+TVM runs CI jobs on every commit to an open pull request and to branches in 
the apache/tvm repo (such as `main`). These jobs are essential to keeping the 
TVM project in a healthy state and preventing breakages. Jenkins does most of 
the work in running the TVM tests, though some smaller jobs are also run on 
GitHub Actions.
+
+## GitHub Actions
+
+GitHub Actions is used to run Windows jobs, MacOS jobs, and various on-GitHub 
automations. These are defined in [`.github/workflows`](../.github/workflows/). 
These automations include bots to:
+* [cc people based on subscribed 
teams/topics](https://github.com/apache/tvm/issues/10317)
+* [allow non-committers to merge approved / CI passing 
PRs](https://discuss.tvm.apache.org/t/rfc-allow-merging-via-pr-comments/12220)
+* [add cc-ed people as reviewers on 
GitHub](https://discuss.tvm.apache.org/t/rfc-remove-codeowners/12095)
+* [ping languishing PRs after no activity for a week (currently opt-in 
only)](https://github.com/apache/tvm/issues/9983)
+* [push a `last-successful` branch to GitHub with the last `main` commit that 
passed CI](https://github.com/apache/tvm/tree/last-successful)
+
+https://github.com/apache/tvm/actions has the logs for each of these 
workflows. Note that when debugging these workflows changes from PRs from 
forked repositories won't be relfected in the PR. These should be tested in the 
forked repository first and linked in the PR body.
+
+
+## Keeping CI Green
+
+Developers rely on the TVM CI to get signal on their PRs before merging.
+Occasionally breakages slip through and break `main`, which in turn causes
+the same error to show up on an PR that is based on the broken commit(s). 
Broken
+commits can be identified [through 
GitHub](https://github.com/apache/tvm/commits/main>)
+via the commit status icon or via 
[Jenkins](https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/activity?branch=main>).
+In these situations it is possible to either revert the offending commit or
+submit a forward fix to address the issue. It is up to the committer and commit
+author which option to choose, keeping in mind that a broken CI affects all TVM
+developers and should be fixed as soon as possible.
+
+Some tests are also flaky and fail for reasons unrelated to the PR. The [CI 
monitoring rotation](https://github.com/apache/tvm/wiki/CI-Monitoring-Runbook) 
watches for these failures and disables tests as necessary. It is the 
responsibility of those who wrote the test to ultimately fix and re-enable the 
test.
+
+
+## Dealing with Flakiness
+
+If you notice a failure on your PR that seems unrelated to your change, you 
should
+search [recent GitHub issues related to flaky 
tests](https://github.com/apache/tvm/issues?q=is%3Aissue+%5BCI+Problem%5D+Flaky+>)
 and
+[file a new 
issue](https://github.com/apache/tvm/issues/new?assignees=&labels=&template=ci-problem.md&title=%5BCI+Problem%5D+>)
+if you don't see any reports of the failure. If a certain test or class of 
tests affects
+several PRs or commits on `main` with flaky failures, the test should be 
disabled via
+[pytest's @xfail 
decorator](https://docs.pytest.org/en/6.2.x/skipping.html#xfail-mark-test-functions-as-expected-to-fail)
 with 
[`strict=True`](https://docs.pytest.org/en/6.2.x/skipping.html#strict-parameter)
 and the relevant issue linked in the

Review Comment:
   =True or =False?



##########
jenkins/README.md:
##########
@@ -15,8 +15,119 @@
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->
 
+# TVM CI
+
+TVM runs CI jobs on every commit to an open pull request and to branches in 
the apache/tvm repo (such as `main`). These jobs are essential to keeping the 
TVM project in a healthy state and preventing breakages. Jenkins does most of 
the work in running the TVM tests, though some smaller jobs are also run on 
GitHub Actions.
+
+## GitHub Actions
+
+GitHub Actions is used to run Windows jobs, MacOS jobs, and various on-GitHub 
automations. These are defined in [`.github/workflows`](../.github/workflows/). 
These automations include bots to:
+* [cc people based on subscribed 
teams/topics](https://github.com/apache/tvm/issues/10317)
+* [allow non-committers to merge approved / CI passing 
PRs](https://discuss.tvm.apache.org/t/rfc-allow-merging-via-pr-comments/12220)
+* [add cc-ed people as reviewers on 
GitHub](https://discuss.tvm.apache.org/t/rfc-remove-codeowners/12095)
+* [ping languishing PRs after no activity for a week (currently opt-in 
only)](https://github.com/apache/tvm/issues/9983)
+* [push a `last-successful` branch to GitHub with the last `main` commit that 
passed CI](https://github.com/apache/tvm/tree/last-successful)
+
+https://github.com/apache/tvm/actions has the logs for each of these 
workflows. Note that when debugging these workflows changes from PRs from 
forked repositories won't be relfected in the PR. These should be tested in the 
forked repository first and linked in the PR body.

Review Comment:
   ```suggestion
   https://github.com/apache/tvm/actions has the logs for each of these 
workflows. Note that when debugging these workflows changes from PRs from 
forked repositories won't be reflected in the PR. These should be tested in the 
forked repository first and linked in the PR body.
   ```



##########
jenkins/README.md:
##########
@@ -26,3 +137,90 @@ pip install -r jenkins/requirements.txt
 python jenkins/generate.py
 ```
 
+# Infrastructure
+
+Jenkins runs in AWS on an EC2 instance fronted by an ELB which makes it 
available at https://ci.tlcpack.ai. These definitions are declared via 
Terraform in the 
[tlc-pack/ci-terraform](https://github.com/tlc-pack/ci-terraform) repository. 
The Terraform code references custom AMIs built in 
[tlc-pack/ci-packer](https://github.com/tlc-pack/ci-packer). 
[tlc-pack/ci](https://github.com/tlc-pack/ci) contains Ansible scripts to 
deploy the Jenkins head node and set it up to interact with AWS.
+
+The Jenkins head node has a number of autoscaling groups with labels that are 
used to run jobs (e.g. `CPU`, `GPU` or `ARM`) via the [EC2 
Fleet](https://plugins.jenkins.io/ec2-fleet/) plugin.
+
+## Deploying
+
+Deploying Jenkins can disrupt developers so it must be done with care. Jobs 
that are in-flight will be cancelled and must be manually restarted. Follow the 
instructions [here](https://github.com/tlc-pack/ci/issues/10) to run a deploy.
+
+## Monitoring
+
+Dashboards of CI data can be found:
+* within Jenkins at https://ci.tlcpack.ai/monitoring (HTTP / JVM stats)
+* at https://monitoring.tlcpack.ai (job status, worker status)
+
+## CI Diagram
+
+This details the individual parts that interact in TVM's CI.

Review Comment:
   should we link to further CI ops docs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to