RE: Re: Fuzzing Parquet C++

Addison Crump Wed, 11 Feb 2026 08:44:58 -0800

(apologies if this is received repeatedly; I'm having some skill-
related issues with the mailing list...)


Hello, all. Nice to see folks interested in this topic!

I responded on bsky already:
https://bsky.app/profile/pitrou.net/post/3mejc4q375c2d

I wanted to also clarify here. The most critical question from what I
see here is a question about input space projection.

Think of it this way: all programs which consume unstructured data
consume unstructured data. Programs consume this in different ways; you
might consider this on different levels of the Chomsky hierarchy, e.g.:

1. Some parts of your program make sure that the data is composed of
valid byte sequences (aka, lexically valid)
2. Some parts of your program make sure that the data is composed of
valid structures (aka, syntactically valid)
3. Some parts of your program make sure that the data is composed of
sound information (aka, semantically valid)
4. Some parts of your program then do stuff with that sound
information.

Mutations from fuzzers like AFL++ work on byte sequences -- i.e., they
emit and mutate unstructured data. From the previous perspective, what
that means is that they are exercising most the lexical part of the
program. As syntactic or semantic complexity increases, you have less
and less ability to test the higher segments of the program because you
are simply searching it less.

So, the next step might be to produce a fuzzer which only produces
lexically-valid data. You can do this with AFL++ with dictionaries! In
fact, modern AFL++ compilers extract "tokens" from source code to build
automatic dictionaries (autodict). Most modern fuzzers now do this.

What's interesting here is that AFL++ also retains the mutations which
are lexically destructive (i.e., some mutations are still not aware of
where tokens are and will corrupt those input regions). But that's
actually advantageous, because it will test the code which checks for
lexical validity.

The next step up would be to produce syntactically valid inputs. For
this, you'll need to move up to other fuzzers; tools like grammarinator
or Nautilus handle this: https://github.com/renatahodovan/grammarinator
(note that these are a bit above context-free, but w/e)

Just like with AFL++ fuzzing "just outside" of lexical validity, you
want inputs that are "just outside" of syntactic validity. There's some
more recent works on that, as well:
https://doi.org/10.1145/3605157.3605170

Now, here's the rub: with unstructured and lexical mutations, you
reason about the input linearly: you basically work with sequences in
either case. But when we use syntactic fuzzers, we reason about the
fuzzer in a tree-like format; you have some AST that you mutate rather
than some sequence. What this means is that the concept of "distance"
changes; two inputs which are structurally similar might be
significantly different unstructured. This means that you _actually
explore the program differently_; you are literally traversing the
program's "behaviour space" (if you want to think about it so
abstractly) in a different way. As a result, the way you progress
through a program changes, and even if you reach the same coverage, the
means by which you reached them is different. As a result, you likely
test the same code regions in different ways. See:
https://doi.org/10.1145/3742894

Moreover, syntactic fuzzers (i.e., those which mutate on the AST)
cannot be easily used in combination with lexical or unstructured
fuzzers; the structured nature means that it is very difficult to
ensure that a lexical mutation doesn't make the structure just
impossible to reason about.

When we move up to semantics, the same step happens again. Now, instead
of structure, you might need to reason about the relationships between
AST nodes -- oftentimes in ways disconnected to the syntax. Consider
writing something like a semantic mutator for C programs. Instead of
reasoning about AST, you might prefer to reason about mutating typed
dataflow graphs, which you then "lower" into AST, which you then
"lower" into lexical components, which you then "lower" into
unstructured data. For the same reasons as before, you want to produce
inputs "just outside" of semantic validity, and the representation of
inputs is not compatible with other levels.

Different representations of program semantics matter. In the XML case,
they changed from semantics of the _input_ to semantics of the _API_.
This is significant, because it changes how mutations are applied
again. Even stuck with the admittedly very, very limited metric that is
coverage, exploring this space with different "input projections" (as
I've taken to calling it) is meaningful.

This, all together, is why neither one harness nor one fuzzer is
enough. For starters, you need different harnesses to test different
parts of the program. Then, you need different fuzzers to produce
different kinds of data for each harness. Then, at a semantic level,
you need different fuzzers to produce different representations of the
semantics of the program under test.

If you want to get started with semantic fuzzing, IMO the best fuzzer
is the one you write to accomplish your task. LibAFL or FANDANGO might
help, but you'll very need a home-grown solution.
- https://github.com/AFLplusplus/LibAFL
- https://github.com/fandango-fuzzer/fandango

I've seen some evidence to suggest that fuzzers like FANDANGO (which is
syntactic + constraints) might not be able to construct deeply
semantically complex inputs vis-a-vis the dataflow example from before,
so consider playing with it, but I really suggest making a higher-order
input representation + mutations.

The most critical takeaway from my talk is one that you've already
taken: you _need_ to think about how the fuzzer interacts with the
program under test. It sounds like you folks have that covered, though!

Do let me know if I can help clarify anything else, or point you to
useful resources.

-- 
Addison Crump | PhD Candidate
CISPA – Helmholtz-Zentrum für Informationssicherheit gGmbH
Stuhlsatzenhaus 5, 66123 Saarbrücken, Deutschland
Room 2.17
p +49 681 87083 2876
e [email protected] | w www.cispa.de

RE: Re: Fuzzing Parquet C++

Reply via email to