Interesting package Sheng, but I think one problem with that approach is that tests are already taking an extremely long time to run. Re-running tests that fail could make it even more difficult to merge PRs.
On Mon, Oct 16, 2017 at 8:45 PM, Zha, Sheng <[email protected]> wrote: > Here’s a package that may help us on flaky tests: > https://pypi.python.org/pypi/flaky. It can retry tests that are marked > flaky and can pass or fail based on specified threshold. In the short term > we can use this to pass tests that are not 100% reliable. > > Best regards, > -sz > > On 10/16/17, 10:32 AM, "kellen sunderland" <[email protected]> > wrote: > > I think you’ve covered the pros/cons of having determinism in your > tests. It seems like a potential maintenance effort versus forced code > robustness argument to me. I’d suggest you have a separate vote on this > topic. > > For me the main topic that should be focused on is making the CI > system fast and stable for PR builds. I think the best road to doing this, > as previously suggested, is to segment tests and move those that are long > running, fail occasionally, or require external components into a > nightly/weekly test. > > I would also propose that: > > 6) Test fixtures are created to test a subset of functionality, and > that we don’t have test fixtures like test_operator.py that are nearly 5000 > lines long, and take 20 minutes to run. There’s a few advantages to > breaking these tests into smaller files: > > We will have fewer merge conflicts, because fewer people will be > editing the same test files across PRs. Debugging issues with tests will > become easier, as hopefully there will be less potential side effects > between tests (this does happen now). We may be a little more confident > that the tests run independently, eventually meaning that we could run them > in parallel more easily, which would reduce test run latency time (but not > throughput). Last, we will be able to disable tests at some convenient > level of granularity, for example when running on IoT devices, or without > OpenCV. At the moment we’d have to disable tests individually. > > 7) We cleanup tests that are no longer needed: > > I’ve personally found it quite unintuitive in MXNet to discover which > tests are actually needed, where they are run, how often, etc. Are the > nightly tests actually being run nightly? Are the cpp tests run, why is > the Travis CI folder still there, what is the difference between ci_build > folder and the Jenkins folder, etc. If we’re going to take a look at > revamping the tests folder I’d recommend we clean up the folder structure a > bit, and delete the non-relevant files to make it easier for newcomers to > know what’s happening. We’ll always have these files for reference in > source control. > > -Kellen > > From: Chris Olivier > Sent: Monday, October 16, 2017 6:46 PM > To: [email protected] > Subject: Re: Improving and rationalizing unit tests > > My argument is that I am actually categorically against having a > requirement that the same input values be used for testing for every > run. > > I don't personally view "convenience in reproducing" as outweighing > "finding edge cases that I didn't think of or that haven't been tried > before". > > On Mon, Oct 16, 2017 at 9:34 AM, Pedro Larroy < > [email protected]> > wrote: > > > It's always going to be deterministic one way or another unless you > use > > random from the entropy pool such as /dev/random. I don't think it's > a good > > practice not to seed properly and have values depend on execution > order / > > parallelism / time or whatever, but that's just my opinion. I would > want to > > use the same values for all test runs for reproducibility. > > > > I think your argument goes more towards the previously mentioned > "property > > based testing" approach, which is in the spirit of what you are > supporting, > > if I'm not mistaken. > > > > On Mon, Oct 16, 2017 at 6:22 PM, Chris Olivier < > [email protected]> > > wrote: > > > > > My take on the suggestion of purely deterministic inputs is > (including > > > deterministic seeding): > > > > > > "I want the same values to be used for all test runs because it is > > > inconvenient when a unit test fails for some edge cases. I prefer > that > > > unforseen edge case failures occur in the field and not during > testing". > > > > > > Is this the motivation? Seems strange to me. > > > > > > > > > On Mon, Oct 16, 2017 at 9:09 AM, Pedro Larroy < > > > [email protected]> > > > wrote: > > > > > > > I think using a properly seeded and initialized (pseudo)random is > > > actually > > > > beneficial (and deterministic), handpicked examples are usually > too > > > > simplistic and miss corner cases. > > > > > > > > Better yet is to use property based testing, which will pick > corner > > cases > > > > and do fuzzing automatically to check with high degree of > confidence > > > that a > > > > testing condition holds. > > > > > > > > Probably it would be good if we use a property based testing > library in > > > > adition to nose to check invariants. > > > > > > > > A quick googling yields this one for python for example: > > > > https://hypothesis.readthedocs.io/en/latest/quickstart.html does > > anyone > > > > have experience or can recommend a nice property based testing > library > > > for > > > > python? > > > > > > > > > > > > Regards > > > > > > > > On Mon, Oct 16, 2017 at 4:56 PM, Bhavin Thaker < > [email protected] > > > > > > > wrote: > > > > > > > > > I agree with Pedro. > > > > > > > > > > Based on various observations on unit test failures, I would > like to > > > > > propose a few guidelines to follow for the unit tests. Even > though I > > > use > > > > > the word, “must” for my humble opinions below, please feel > free to > > > > suggest > > > > > alternatives or modifications to these guidelines: > > > > > > > > > > 1) 1a) Each unit test must have a run time budget <= X > minutes. Say, > > X > > > = > > > > 2 > > > > > minutes max. > > > > > 1b) The total run time budget for all unit tests <= Y minutes. > Say, > > Y = > > > > 60 > > > > > minutes max. > > > > > > > > > > 2) All Unit tests must have deterministic (not Stochastic) > behavior. > > > That > > > > > is, instead of using the random() function to test a range of > input > > > > values, > > > > > each input test value must be carefully hand-picked to > represent the > > > > > commonly used input scenarios. The correct place to > stochastically > > test > > > > > random input values is to have continuously running nightly > tests and > > > NOT > > > > > the sanity/smoke/unit tests for each PR. > > > > > > > > > > 3) All Unit tests must be as much self-contained and > independent of > > > > > external components as possible. For example, datasets > required for > > the > > > > > unit test must NOT be present on external website which, if > > > unreachable, > > > > > can cause test run failures. Instead, all datasets must be > available > > > > > locally. > > > > > > > > > > 4) It is impossible to test everything in unit tests and so > only > > common > > > > > use-cases and code-paths must be tested in unit-tests. Less > common > > > > > scenarios like integration with 3rd party products must be > tested in > > > > > nightly/weekly tests. > > > > > > > > > > 5) A unit test must NOT be disabled on a failure unless proven > to > > > exhibit > > > > > unreliable behavior. The burden-of-proof for a test failure > must be > > on > > > > the > > > > > PR submitter and the PR must NOT be merged without a opening a > new > > > github > > > > > issue explaining the problem. If the unit test is disabled for > some > > > > reason, > > > > > then the unit test must NOT be removed from the unit tests > list; > > > instead > > > > > the unit test must be modified to add the following lines at > the > > start > > > of > > > > > the test: > > > > > Print(“Unit Test DISABLED; see GitHub issue: NNNN”) > > > > > Exit(0) > > > > > > > > > > Please suggest modifications to the above proposal such that > we can > > > make > > > > > the unit tests framework to be the rock-solid foundation for > the > > active > > > > > development of Apache MXNet (Incubating). > > > > > > > > > > Regards, > > > > > Bhavin Thaker. > > > > > > > > > > > > > > > On Mon, Oct 16, 2017 at 5:56 AM Pedro Larroy < > > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > Hi > > > > > > > > > > > > Some of the unit tests are extremely costly in terms of > memory and > > > > > compute. > > > > > > > > > > > > As an example in the gluon tests we are loading all the > datasets. > > > > > > > > > > > > test_gluon_data.test_datasets > > > > > > > > > > > > Also running huge networks like resnets in > test_gluon_model_zoo. > > > > > > > > > > > > This is ridiculously slow, and straight impossible on some > > embedded / > > > > > > memory constrained devices, and anyway is making tests run > for > > longer > > > > > than > > > > > > needed. > > > > > > > > > > > > Unit tests should be small, self contained, if possible pure > > > (avoiding > > > > > this > > > > > > kind of dataset IO if possible). > > > > > > > > > > > > I think it would be better to split them in real unit tests > and > > > > extended > > > > > > integration test suites that do more intensive computation. > This > > > would > > > > > also > > > > > > help with the feedback time with PRs and CI infrastructure. > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > >
