On 11/19/2012 11:33 PM, Eric Rescorla wrote:
TL;DR: I think this is almost 180 degrees opposed from the right strategy. What we need right now is coverage of the things that work so that when we make changes we know we're not making the situation worse. *All* other testing work should wait on this. In particular, crashtests are not useful at this point.
I'll note that "crashtests" is a term being used loosely here. Crashtests are nothing more than "when I load this nothing crashes", while mochitests are "when I load this it calls Ok or No Ok" (effectively). So really, there isn't a lot of difference there beyond (for most WebRTC things) "we got to the normal end of the test".
The way some people are using "crashtests" is really "we once crashed here, capture that as a crashtest", with the assumption you're protecting against that (exact) regression.
In active development of a new feature, tests for real-world(ish) cases are most important. This is (as ekr points out) that the first order of things is "don't break existing functionality". Testing for that on each checkin is far more important than checking if we regressed a minimized testcase from a specific bug.
Once we have the main functionality tests in place (both "sunny day" and "normal failure" cases), then more-specific regression tests begin to rise in importance, though they still should usually be as broad a test as possible that captures the failure instead of all minimized testcases. The reason is that for it to have value for the future, it should not just capture someone undoing the bugfix, but should also capture any similarly-provoked failures as much as is possible, since most future errors will not precisely duplicate an old bug, but will arise from similar stress on the component.
An example might be a bug triggered by having >10 DataChannels open. The minimized crashtest testcase might stop there, and it *would* prevent the specific bug from being regressed. But if some new checkin caused >16 DataChannels to fail, this crashtest wouldn't catch it, so a good (IMO) test would create an unreasonable number of DataChannels, do something basic with them to verify they all work, and then return success. Note that this ends up being a Mochitest, not a crashtest! So, don't think "this was a crash bug, it needs a crashtest". I would use crashtests only where there's no reasonable way to report success or failure. For example, if we changed an API, a crashtest would happily stay green, while as a mochitest it would flag that we broke the test.
This does mean almost all tests should be mochitests. That's not a bad thing, it's a good thing! But the bigger takeaway is that tests should start with functionality (as the feature is useless without that), then add general stressors (create/destroy fast, create a lot, evil arrangements, obvious edge cases, etc), then (especially as code matures and stabilizes) add very specific tests. If you focus on the very specific tests too early, you end up with a ton of those many of which no longer are relevant due to architectural changes and refactors (and may spend a lot of time revising them when an API is still in flux!) This isn't an absolute, and it doesn't mean that a crash bug doesn't get morphed into a (mochi)test - many still will, but again those should be broader tests wherever possible, or it should be worked into an existing broad test.
That's another point: a few broad tests are a lot simpler to revise when we have an unstable API than a thousand specific-instance testcases. Given the W3C (and IETF) are still changing the API, we should be cautious about this. If a large, generic sunny-day test that hits much of the API fails, that's generally of (much) more use to me today the the same set of features being tested in 100 small separate tests, if for no other reason than the maintenance effect (and there are other reasons, per above.)
Analysis follows: Currently, the state of the code can be best described as brittle. The sunny day cases work and work fairly reliably, but as you deviate from them, you start to get crashes. This isn't surprising; it's a fairly normal state of affairs for software that was developed in a hurry with an emphasis on features rather than testing, which is what this is. There are two major classes of problems in the code: - Functional defects (stuff which should work and does not). - Crashes (i.e., memory errors, whether security relevant or not.) - Memory and other resource leaks There are a lot of all of these, and a number of them require fairly significant refactoring to fix. The result of this is that: 1. It's not going to be possible to *run* automated unit tests for some time (this is partly due to deficiencies in the testing framework, but you go to the release with the testing framework you have, not the one you wish you had.) 2. It's quite common to have multiple reports of problems due to the same underlying defect. Even when that's not true, we're often going to accidentally fix one defect while working on another.
Not just accidentally, but often on purpose as we notice a logical hole that can or has caused a number of similar failures. A perfect example is the current "big lock" patch, which causes numerous semi-random failures related to timing.
3. It's quite easy to break one piece of functionality while refactoring another. We've already had a number of instances on this.
Yes; that's my number one concern for tests at this point. We're checking in code with the only true cross-check being ourselves and other devs running manual tests and using the feature. While doing that is good, tests (automated or run-by-hand-but-scripted) would greatly reduce the "you broke calls" problem.
Rather, what is useful is to develop unit tests for the stuff that does work so that when we are fixing one thing we don't accidentally break another.I realize that these cannot currently be run automatically on m-c, but they can be run either (a) on a developer's own system or (b) on alder under the more reslient test framework Ted is hacking up. It already takes far too long to validate manually that your changes haven't broken stuff and we don't have anything like full coverage in JS tests of the functionality that works. If we want people not to break the system, it needs to be reasonably easy to verify that one has not done so.
Agreed.
The nice thing about this level of test is that it's easy to develop. Indeed, we already have a number of tests that test one variant manually. All we need is someone to stamp out a bunch of tests that (a) are automatic and (b) cover more of the space. If you find bugs doing that, then great. Otherwise, we have a regression harness and you can move on to crashtests.
Agreed, and honestly a basic set of such functionality tests should be fairly quick to develop and provide significant help to the project quickly. And they don't (in my mind) have to be a lot of specific small functionality tests. For example, I'd rather have one test that covered all the basic call scenarios (audio-only, video-only, audio/video, one-way audio, one-way video, etc) - because they are likely to be easier to revise if the API changes, and if it fails, we can find out why quickly. Not saying it has to be that way, but I might prefer that, especially if I can build a "basic call test framework" I can then internally automate. See things like the test_URIs xpcshell test for the URI code where it has a framework and then runs a bazillion test/edge cases through it, all in one "test".
-- Randell Jesup _______________________________________________ dev-media mailing list [email protected] https://lists.mozilla.org/listinfo/dev-media

