Hi Ceph,

TL;DR: a ceph-qa-suite bot running on pull requests is sustainable and is an 
incentive for contributors to use teuthology-openstack independently

When a pull request is submitted, it is compiled, some tests are run[1] and the 
result is added to the pull request to confirm that it does not introduce a 
trivial problem. Such tests are however limited because they must:

* run within a few minutes at most
* not require multiple machines
* not require root privileges

More extensive tests (primarily integration tests) are needed before a 
contribution can be merged into Ceph [2], to verify it does not introduce a 
subtle regression. It would be ideal to run these integration tests on each 
pull request but there are two obstacles:

* each test takes ~ 1.5 hour
* each test cost ~ 0.30 euros

On the current master, running all tests would require ~1000 jobs [3]. That 
would cost ~ 300 euros on each pull request and take ~10 hours assuming 100 
jobs can run in parallel. We could resolve that problem by:

* maintaining a ceph-qa-suite map to be used as a white list mapping a diff to 
a set of tests. For instance, if the diff modifies the src/ceph-disk file, it 
outputs the ceph-disk suite[4]. This would effectively trim the tests that are 
unrelated to the contribution and reduce the number of tests to a maximum of 
~100 [4] and most likely a dozen.
* tests are run if one of the commits of the pull request has the *Needs-qa: 
true* flag in the commit message[5]
* limiting the number of tests to fit in the allocated budget. If there was 
enough funding for 10,000 jobs during the previous period and there was a total 
of 1,000 test run required (a test run is a set of tests as produced by the 
ceph-qa-suite map), each run is trimmed to a maximum of ten tests, regardless.

Here is an example:

Joe submits a pull request to fix a bug in the librados API
The make check bot compiles and fails make check because it introduces a bug
Joe uses run-make-check.sh locally to repeat the failure, fixes it and repush
The make check bot compiles and passes make check
Joe amends the commit message to add *Needs-qa: true* and repushes
The ceph-qa-suite map script finds a change on the librados API and outputs 
smoke/basic/tasks/rados_api_tests.yaml
The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml 
which fails
Joe examines the logs found at http://teuthology-logs.public.ceph.com/ and 
decides to debug by running the test himself
Joe runs teuthology-openstack --suite smoke/basic/tasks/rados_api_tests.yaml 
against his own OpenStack tenant [6]
Joe repush with a fix
The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml 
which succeeds
Kefu reviews the pull request and has a link to the successful test runs in the 
comments

This approach scales with the size of the Ceph developer community [7] because 
regular contributors benefit directly from funding the ceph-qa-suite bot. New 
contributors can focus on learning how to interpret the ceph-qa-suite error 
logs for their contribution and learn about how to debug it via 
teuthology-openstack if needed, which is a better user experience than trying 
to figure out which ceph-qa-suite job to run, learning about teuthology, 
schedule the test and interpret the results.

The maintenance workload of a ceph-qa-suite bot probably requires one work day 
a week, to handle funding, sysadmin of the server where the bot runs but mostly 
to sort out the false negatives. I believe a pure self-service approach where 
each contributor would be asked to run teuthology-openstack independently would 
actually require more work. The ceph-qa-suite bot provides a baseline on which 
everybody can agree to sort out the false negatives. When a contributor runs 
teuthology-openstack by herself/himself, it is difficult for her/him to figure 
out if a failure comes from something she/he did incorrectly because she/he is 
not familiar with teuthology-openstack or if it is related to her/his 
contribution. She/He will asks for assistance  in situations where comparing 
her/his run with the output of the ceph-qa-suite bot would probably give 
her/him enough hints to fix the problem herself/himself.

If the ceph-qa-suite bot becomes unavailable, the contributors are not blocked 
because they can run it by themselves on their own OpenStack tenant and link 
the results to the pull request in the same way the bot would. Debugging a 
failed test is essentially the same thing as running the ceph-qa-suite bot.

Cheers

[1] run-make-check.sh https://github.com/ceph/ceph/blob/master/run-make-check.sh
[2] Ceph test suites https://github.com/ceph/ceph-qa-suite/tree/master/suites
[3] teuthology-suite --suite .  --subset 1/40000
[4] minimal number of tests to run all tasks at least once: 130 for rados, 76 
for fs, 113 for upgrade, 18 for rgw, 45 for rbd.
[5] a former proposal was to include the test suite to run in the commit 
message, but this is more difficult to maintain that a boolean flag that states 
a given commit needs to pass all the relevant tests
[6] teuthology-openstack 
https://github.com/dachary/teuthology/tree/openstack#openstack-backend
[7] Scaling out the Ceph community lab http://dachary.org/?p=3852
-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to