[Numpy-discussion] Re: Fuzzing integration of Numpy into OSS-Fuzz

david korczynski Thu, 09 Jun 2022 13:41:07 -0700

Coverage-guided fuzzing is fundamentally just a technique that iteratively 
generates input that explores more code relative to the possible execution 
space of the code targeted. What the fuzzer gives you to play with is a 
byte-array that you can massage in any way possible and pass it into the code 
under analysis. The fuzz engine will then observe whether the code under 
analysis executed in a way that was not seen before, and save the given byte 
array.

Using this you can test for so many things. The way you describe using
hypothesis in terms of testing a given input and whether some post condition is
satisfied: you can do this with fuzzing by converting the byte-array from the
fuzzer into higher level data structures, pass these data structures into the
target code and then use the same asserts to see if all post conditions are
satisfied.

In the context of Numpy, what we can test for are:
1) Memory corruption issues in the native code (OSS-Fuzz will compile it with
sanitizers).
2) Unexpected exceptions, i.e. call functions in Numpy with a data that is
seeded with fuzz input and ensure no exceptions are raised besides those
documented.
3) Behavioural testing similar to how you describe using Hypothesis.

In the OSS-Fuzz PR I added a fuzzer that tests option (2) listed above:
https://github.com/google/oss-fuzz/pull/7681

You're right in that the fuzzing will continue to explore the search space
whenever it runs into an issue. OSS-Fuzz, however, comes with a large backend
that manages all the running of the fuzzers and will do de-duplication such
that a bug is only reported once even if the fuzzer hits it N times.

Kind regards,
David

On 08/06/2022 21:46, Aaron Meurer wrote:
I know the hypothesis developers consider Hypothesis to be different from fuzzing. But I've never
been exactly clear just what is meant by "fuzzing" in the context you are suggesting.
When you say you want to "fuzz NumPy" what sorts of things would the fuzzer be doing?
Would you need to tell it what various NumPy functions and operations are and how to generate
inputs for them? Or does it do that automatically somehow? And how would you tell it what sorts of
things to check for a given set of inputs?

For a Hypothesis test, you would tell it explicitly what the input is, like "a is an array with
some given properties (e.g., >1 dim, has a numerical dtype, has positive values, etc.)". Then
you explicitly write a bunch of assertions that such arrays should satisfy (like some f(a).all()). It
then generates examples from the given set of inputs in an attempt to falsify the given assertions.
The whole process requires a considerable amount of human work because you have to figure out a bunch
of properties that various operations should satisfy on certain sets of inputs and write tests for
them. I'm still unclear on just what "fuzzing" is, but my impression has always been that
it's not this.

One difference I do know between hypothesis and a fuzzer is that hypothesis is
more geared toward finding test failures and getting you to fix them. So for
example, Hypothesis only runs 100 examples by default each run. You have to
manually increase that number to run more. Another difference is if Hypothesis
finds a failure, it will fixate on that failure and always return it, even to
the detriment of finding other possible failures, until you either fix it or
modify the strategies to ignore it. My understanding is that a fuzzer is more
geared toward exploring a wide search space and finding as many possible issues
as possible, even if there isn't the immediate possibility of them becoming
fixed.

I've used Hypothesis on several projects that depend on NumPy and incidentally
found several bugs in NumPy with it (for example,
https://github.com/numpy/numpy/issues/15753).

Aaron Meurer

On Wed, Jun 8, 2022 at 8:44 AM david korczynski
<da...@adalogics.com<mailto:da...@adalogics.com>> wrote:
I'm not 100% about the important differences, so this is a bit of an
intuitive analysis from my side (I know little about Hypothesis and more
about fuzzing).

Hypothesis has support for traditional fuzzing [sic]:
https://hypothesis.readthedocs.io/en/latest/details.html?highlight=fuzz#use-with-external-fuzzers
and OSS-Fuzz supports using Python fuzzing by way of Hypothesis
https://google.github.io/oss-fuzz/getting-started/new-project-guide/python-lang/#hypothesis
although it will be seeded with the Atheris fuzzer and based on this
issue https://github.com/google/atheris/issues/20 it seems Atheris +
Hypothesis might not be working particularly well together.

I think based on the above and skimming through the Hypothesis docs that
there are many similarities between fuzzing (Atheris specifically) but
the underlying engine that explores the input space is different.
Fuzzing is coverage-guided (which I don't think Hypothesis is, but I
could be wrong), meaning the target program is instrumented to identify
if a newly generated input explores new code. In essence, this makes
fuzzing a mutational genetic algorithm. Another benefit is OSS-Fuzz will
build the target code with various sanitizers (ASan, UBSan, MSan) which
will help highlight issues in the native code.

About the why it would be great to fuzz more Python code, then this was
more of a general statement in that a lot of effort is being put into
this from the OSS-Fuzz side because Python is a widely used language.
For example, an effort in this domain is investigation into new bug
oracles for Python (like sanitizers but targeted memory safe languages).

On 07/06/2022 15:10, Matti Picus wrote:


On 7/6/22 14:02, david korczynski wrote:

Hi Numpy maintainers,

Would you be interested in integrating continuous fuzzing by way of
OSS-Fuzz? Fuzzing is a way to automate test-case generation and has been
heavily used for memory unsafe languages. Recently efforts have been put
into fuzzing memory safe languages and Python is one of the languages
where it would be great to use fuzzing.

...

Let me know your thoughts on this and if you have any questions as I’m
happy to clarify or go more into details with fuzzing.

Kind regards,
David



Could you compare and contrast this to hypothesis [0], which we are
already using in our testing?

I don't understand what you mean by "Python is one of the languages
where it would be great to use fuzzing". Why?

Matti


[0] https://hypothesis.readthedocs.io/en/latest/index.html

_______________________________________________
NumPy-Discussion mailing list -- 
numpy-discussion@python.org<mailto:numpy-discussion@python.org>
To unsubscribe send an email to 
numpy-discussion-le...@python.org<mailto:numpy-discussion-le...@python.org>
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: da...@adalogics.com<mailto:da...@adalogics.com>
ADA Logics Ltd is registered in England. No: 11624074.
Registered office: 266 Banbury Road, Post Box 292,
OX2 7DL, Oxford, Oxfordshire , United Kingdom

ADA Logics Ltd is registered in England. No: 11624074.
Registered office: 266 Banbury Road, Post Box 292,
OX2 7DL, Oxford, Oxfordshire , United Kingdom
_______________________________________________
NumPy-Discussion mailing list -- 
numpy-discussion@python.org<mailto:numpy-discussion@python.org>
To unsubscribe send an email to 
numpy-discussion-le...@python.org<mailto:numpy-discussion-le...@python.org>
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: asmeu...@gmail.com<mailto:asmeu...@gmail.com>
ADA Logics Ltd is registered in England. No: 11624074.
Registered office: 266 Banbury Road, Post Box 292,
OX2 7DL, Oxford, Oxfordshire , United Kingdom


_______________________________________________
NumPy-Discussion mailing list -- 
numpy-discussion@python.org<mailto:numpy-discussion@python.org>
To unsubscribe send an email to 
numpy-discussion-le...@python.org<mailto:numpy-discussion-le...@python.org>
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: da...@adalogics.com<mailto:da...@adalogics.com>


ADA Logics Ltd is registered in England. No: 11624074.
Registered office: 266 Banbury Road, Post Box 292,
OX2 7DL, Oxford, Oxfordshire , United Kingdom

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Fuzzing integration of Numpy into OSS-Fuzz

Reply via email to