[issue42109] Use hypothesis for testing the standard library, falling back to stubs

Zac Hatfield-Dodds Thu, 13 May 2021 21:22:19 -0700


Zac Hatfield-Dodds <zac.hatfield.do...@gmail.com> added the comment:


(I think structuring this as a high-level explanation is clearer than 
point-by-point replies - it covers most of them, but the logic is hopefully 
easier to follow)


The core idea of Hypothesis is that making random choices is equivalent to 
parsing a random bitstring:
    - wrap your test function in a decorator which calls it many times with 
random inputs
    - describe those inputs with strategies, (which are parser combinators)
    - the bytes parsed by strategies can be provided by a Random() instance, 
saved to and loaded from a database, edited and (maybe) re-parsed into related 
or simpler inputs, etc.
Minithesis [1] is a complete implementation of this core, including strategies, 
shrinking, and the database, in a few hundred lines of Python.  I think reading 
this will answer most of your implementation questions.

Hypothesis is not the *only* property-based testing library for Python (there's 
pytest-quickcheck), but in the same sense that `six` is not the only py2/py3 
compatibility layer.  I attribute this to a combination of a better underlying 
model [2], and brute implementation effort to provide great error messages, 
integration with popular libraries, lots of strategies, related tooling, and so 
on.

[1] https://github.com/DRMacIver/minithesis
    See also https://hypothesis.works/articles/how-hypothesis-works/ for an 
older writeup; the implementation details are a little outdated but it explains 
the motivations better.
[2] https://hypothesis.works/articles/types-and-properties/



So a "strategy" is an object (implementation detail: a parser combinator, 
instance of SearchStrategy), which tells Hypothesis how to convert a random 
bytestring into whatever object you've asked for.  This is a little unusual, 
but I haven't found a better way which still makes it easy to mix the strategy 
primitives with arbitrary user code such as "pick a username, ensure it's in 
the database, and then pass it to my test method".

I share your distaste for "Haskell in Python"; I don't think Hypothesis 
introduces more of that flavor than Numpy would introduce Fortran idioms - 
there's occasionally a resemblance, but they're both Python-first libraries and 
I'd attribute it mostly to the problem being solved.  FWIW Hypothesis 
deliberately rejected most of the design choices of QuickCheck, and 
implementation-wise has more in common with unixy fuzzing tools like AFL.

The strategies in our API can generally be divided into useful primitives (e.g. 
integers, floats, lists, tuples, sampled_from, one_of) and pre-composed 
strategies (booleans = sampled_from, dictionaries = map+lists-of-kv-tuples, 
builds = tuples+fixed_dictionaries+map).
Very specific strategies like emails() are included because they're a common 
need, and a naive implementation usually omits exactly the weirder and 
error-prone features that might reveal bugs.

We _are_ deliberately vague about the order in which we generate test data, 
because the distribution is full of heuristics and adapts itself to the 
behaviour of the test - including over multiple runs, thanks to the database.  
The exact details are also considered an implementation detail, and subject to 
change as we discover heuristics and sampling techniques which find bugs faster.

The upside here is writing effective strategies is straightforward:
    - expressing more possible (valid) inputs is good, because e.g. if the bug 
triggers on NaN we'd have to generate some NaNs to detect it; and
    - limiting use of rejection sampling (hypothesis.assume(), the .filter() 
method) is an obvious performance improvement - if only 20% of inputs we try to 
generate are valid, the test will take five times longer.
following which Hypothesis' backend (the "conjecture" engine) uses a large 
helping of heuristics and fancy code to ensure that we get a nice diverse 
sample from the inputs space.  Unfortunately there's a direct tension between 
high performance and explainable behaviour; Minithesis is readable mostly by 
virtue of not handling the hard edge cases.



I'll take my Hypothesmith [3] project for random Python programs as an example 
of a complex strategy.  It was inspired by CSmith [4] - but is considerably 
less sophisticated.  CSmith constructs semantically valid C programs; 
Hypothesmith (for now!) constructs a subset of syntatically-valid Python.  The 
basic idea is to "invert" the grammar: start with the root node, recursively 
construct a valid parse tree, and then serialise it out to a parsable input 
string.  Doing a better job would require more time than I had for a 
proof-of-concept, but no new insights; and the same applies to defining a 
strategy for semantically-valid code.  Being based on the grammar I'd guess 
it's unlikely to find corner cases in the grammar, but it might do so in the 
parser, or via e.g. `(tree := ast.parse(code)) == ast.parse(ast.unparse(tree))`.

Your distrust of the python_programs() strategy is entirely reasonable, and I'd 
also want to understand it.  I'd just add that it doesn't need to be elegant, 
or general, or express all inputs... so long as it can find valid inputs that 
reveal a bug, more easily or cheaply than users or other testing techniques :-)

To test compiler optimizations, yes, once you had the strategy that exact test 
body would work.  An even more effective option is to use "equivalence modulo 
inputs" [5] to derive `func2` and also assert that `optimizer(func)(arg) == 
optimizer(func2)(arg)`.

[3] https://pypi.org/project/hypothesmith/ ; in action at 
https://github.com/psf/black/blob/main/fuzz.py
[4] https://embed.cs.utah.edu/csmith/
[5] http://vuminhle.com/pdf/pldi14-emi.pdf



I hope that helps; in particular Minithesis should take most of the magic out 
of it.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue42109>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue42109] Use hypothesis for testing the standard library, falling back to stubs

Reply via email to