On Sat, Dec 5, 2015 at 11:49 AM, Loic Dachary <l...@dachary.org> wrote: > Hi Ceph, > > TL;DR: a ceph-qa-suite bot running on pull requests is sustainable and is an > incentive for contributors to use teuthology-openstack independently
A bot for scheduling a named suite on a named PR, and posting the results back the PR is definitely a good thing. Thinking further about using commit messages to toggle the testing, I think that this could get awkward when it's coupled to the human side of code review. When someone pushes a "how about this?" modification they don't necessarily want to re-run the test suite until the reviewer has okayed it, but then that means that they have to push again, and the final thing that's tested would be a different SHA1 (hopefully the same code) than what the human last reviewed. We'll also have e.g. rebases, where there tends to be some discretion about whether a rebase requires a re-test. When you were talking about having the suite selected in the qa: tag, there was the motivation to put it in the commit message so that it would be preserved in backports. However, if the "Needs-qa:" flag is just a boolean, then I think it makes more sense to control it with a github label or by posting a command in a PR comment. I'm not sure how this really helps with the resource issues; for example with the fs suite we would probably not be able to make a finer-grained choice about what tests to run based on the diff. The part about randomly dropping a subset of tests when resources are low doesn't make sense to me -- I think the bot should either give up or enqueue itself. Cheers, John > When a pull request is submitted, it is compiled, some tests are run[1] and > the result is added to the pull request to confirm that it does not introduce > a trivial problem. Such tests are however limited because they must: > > * run within a few minutes at most > * not require multiple machines > * not require root privileges > > More extensive tests (primarily integration tests) are needed before a > contribution can be merged into Ceph [2], to verify it does not introduce a > subtle regression. It would be ideal to run these integration tests on each > pull request but there are two obstacles: > > * each test takes ~ 1.5 hour > * each test cost ~ 0.30 euros > > On the current master, running all tests would require ~1000 jobs [3]. That > would cost ~ 300 euros on each pull request and take ~10 hours assuming 100 > jobs can run in parallel. We could resolve that problem by: > > * maintaining a ceph-qa-suite map to be used as a white list mapping a diff > to a set of tests. For instance, if the diff modifies the src/ceph-disk file, > it outputs the ceph-disk suite[4]. This would effectively trim the tests that > are unrelated to the contribution and reduce the number of tests to a maximum > of ~100 [4] and most likely a dozen. > * tests are run if one of the commits of the pull request has the *Needs-qa: > true* flag in the commit message[5] > * limiting the number of tests to fit in the allocated budget. If there was > enough funding for 10,000 jobs during the previous period and there was a > total of 1,000 test run required (a test run is a set of tests as produced by > the ceph-qa-suite map), each run is trimmed to a maximum of ten tests, > regardless. > > Here is an example: > > Joe submits a pull request to fix a bug in the librados API > The make check bot compiles and fails make check because it introduces a bug > Joe uses run-make-check.sh locally to repeat the failure, fixes it and repush > The make check bot compiles and passes make check > Joe amends the commit message to add *Needs-qa: true* and repushes > The ceph-qa-suite map script finds a change on the librados API and outputs > smoke/basic/tasks/rados_api_tests.yaml > The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml > which fails > Joe examines the logs found at http://teuthology-logs.public.ceph.com/ and > decides to debug by running the test himself > Joe runs teuthology-openstack --suite smoke/basic/tasks/rados_api_tests.yaml > against his own OpenStack tenant [6] > Joe repush with a fix > The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml > which succeeds > Kefu reviews the pull request and has a link to the successful test runs in > the comments > > This approach scales with the size of the Ceph developer community [7] > because regular contributors benefit directly from funding the ceph-qa-suite > bot. New contributors can focus on learning how to interpret the > ceph-qa-suite error logs for their contribution and learn about how to debug > it via teuthology-openstack if needed, which is a better user experience than > trying to figure out which ceph-qa-suite job to run, learning about > teuthology, schedule the test and interpret the results. > > The maintenance workload of a ceph-qa-suite bot probably requires one work > day a week, to handle funding, sysadmin of the server where the bot runs but > mostly to sort out the false negatives. I believe a pure self-service > approach where each contributor would be asked to run teuthology-openstack > independently would actually require more work. The ceph-qa-suite bot > provides a baseline on which everybody can agree to sort out the false > negatives. When a contributor runs teuthology-openstack by herself/himself, > it is difficult for her/him to figure out if a failure comes from something > she/he did incorrectly because she/he is not familiar with > teuthology-openstack or if it is related to her/his contribution. She/He will > asks for assistance in situations where comparing her/his run with the > output of the ceph-qa-suite bot would probably give her/him enough hints to > fix the problem herself/himself. > > If the ceph-qa-suite bot becomes unavailable, the contributors are not > blocked because they can run it by themselves on their own OpenStack tenant > and link the results to the pull request in the same way the bot would. > Debugging a failed test is essentially the same thing as running the > ceph-qa-suite bot. > > Cheers > > [1] run-make-check.sh > https://github.com/ceph/ceph/blob/master/run-make-check.sh > [2] Ceph test suites https://github.com/ceph/ceph-qa-suite/tree/master/suites > [3] teuthology-suite --suite . --subset 1/40000 > [4] minimal number of tests to run all tasks at least once: 130 for rados, 76 > for fs, 113 for upgrade, 18 for rgw, 45 for rbd. > [5] a former proposal was to include the test suite to run in the commit > message, but this is more difficult to maintain that a boolean flag that > states a given commit needs to pass all the relevant tests > [6] teuthology-openstack > https://github.com/dachary/teuthology/tree/openstack#openstack-backend > [7] Scaling out the Ceph community lab http://dachary.org/?p=3852 > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html