Paul Ganssle <> added the comment:


> The problem with random input tests in not that they are 'flakey', but that 
> they are useless unless someone is going to pay attention to failures and try 
> to find the cause.  This touches on the difference between regression testing 
> and bug-finding tests.  CPython CI is the former, and marred at that by buggy 
> randomly failing tests.

> My conclusion: bug testing would likely be a good idea, but should be done 
> separate from the CI test suite.  Such testing should only be done for 
> modules with an active maintainer who would welcome failure reports.

Are you saying that random input tests are flaky but that that is not the big 
problem? In my experience using hypothesis, in practice it is not the case that 
you get tests that fail randomly. The majority of the time if your code doesn't 
violate one of the properties, the tests fail the first time you run the test 
suite (this is particularly true for strategies where hypothesis deliberately 
makes it more likely that you'll get a "nasty" input by biasing the random 
selection algorithm in that direction). In a smaller number of cases, I see 
failures that happen on the second, third or fourth run.

That said, if it were a concern that every run of the tests is using different 
inputs (and thus you might see a bug that only appears once in every 20 runs), 
it is possible to run hypothesis in a configuration where you specify the seed, 
making it so that hypothesis always runs the same set of inputs for the same 
tests. We can disable that on a separate non-CI run for hypothesis "fuzzing" 
that would run the test suite for longer (or indefinitely) looking for 
long-tail violations of these properties.

I feel that if we don't at least run some form of the hypothesis tests in CI, 
there will likely be bit rot and the tests will decay in usefulness. Consider 
the case where someone accidentally breaks an edge case that makes it so that 
`json.loads(json.dumps(o))` no longer works for some obscure value of `o`. With 
hypothesis tests running in CI, we are MUCH more likely to find this bug / 
regression during the initial PR that would break the edge case than if we run 
it separately and report it later. If we run the hypothesis tests in a 
build-bot, the process would be:

1. Contributor makes PR with passing CI.
2. Core dev review passes, PR is merged.
3. Buildbot run occurs and the buildbot watch is notified.
4. Buildbot maintainers track down the PR responsible and either file a new bug 
or comment on the old bug.
5. Someone makes a NEW PR adding a regression test and the fix for the old PR.
6. Core dev review passes, second PR is merged.

If we run it in CI, the process would be:

1. Contributor makes PR, CI breaks.
2. If the contributor doesn't notice the broken CI, core dev points it out and 
it is fixed (or the PR is scrapped as unworkable).

Note that in the non-CI process, we need TWO core dev reviews, we need TWO PRs 
(people are not always super motivated to fix bugs that don't affect them that 
they the caused when fixing a bug that does affect them), and we need time and 
effort from the buildbot maintainers (note the same applies even if the 
"buildbot" is actually a separate process run by Zac out of a github repo).

Even if the bug only appears in one out of every 4 CI runs, it's highly likely 
that it will be found and fixed before it makes it into production, or at least 
much more quickly, considering that most PRs go through a few edit cycles, and 
a good fraction of them are backported to 2-3 branches, all with separate CI 
runs. It's a much quicker feedback loop.

I think there's an argument to be made that incorporating more third-party 
libraries (in general) into our CI build might cause headaches, but I think 
that is not a problem specific to hypothesis, and I think its one where we can 
find a reasonable balance that allows us to use hypothesis in one form or 
another in the standard library.


Python tracker <>
Python-bugs-list mailing list

Reply via email to