Re: Re: Fuzzing Parquet C++

Steve Loughran Mon, 02 Mar 2026 12:31:25 -0800

OK, this makes more sense.

Something interesting here would be generating malicious .parquet files
which are technically well formed w.r.t a schema, yet somehow capable of
breaking a system in ways which can affect services. I'm thinking of
systems which take .parquet files as part of their workflow, including from
untrusted sources, and which could suffer if legal-but-malicious parquet
files got served up.


I am thinking about shared public parquet datasets through iceberg and the
like. If someone serves data through a catalog, could they return malicious
data to some of the clients.

Here the fuzzer would be generating valid files, but with things like the
rowgroup set to its minimum (or up to TB), any encoding which triggers OOM
issues etc.

On Wed, 11 Feb 2026 at 16:44, Addison Crump <[email protected]> wrote:

> (apologies if this is received repeatedly; I'm having some skill-
> related issues with the mailing list...)
>
> Hello, all. Nice to see folks interested in this topic!
>
> I responded on bsky already:
> https://bsky.app/profile/pitrou.net/post/3mejc4q375c2d
>
> I wanted to also clarify here. The most critical question from what I
> see here is a question about input space projection.
>
> Think of it this way: all programs which consume unstructured data
> consume unstructured data. Programs consume this in different ways; you
> might consider this on different levels of the Chomsky hierarchy, e.g.:
>
> 1. Some parts of your program make sure that the data is composed of
> valid byte sequences (aka, lexically valid)
> 2. Some parts of your program make sure that the data is composed of
> valid structures (aka, syntactically valid)
> 3. Some parts of your program make sure that the data is composed of
> sound information (aka, semantically valid)
> 4. Some parts of your program then do stuff with that sound
> information.
>
> Mutations from fuzzers like AFL++ work on byte sequences -- i.e., they
> emit and mutate unstructured data. From the previous perspective, what
> that means is that they are exercising most the lexical part of the
> program. As syntactic or semantic complexity increases, you have less
> and less ability to test the higher segments of the program because you
> are simply searching it less.
>
> So, the next step might be to produce a fuzzer which only produces
> lexically-valid data. You can do this with AFL++ with dictionaries! In
> fact, modern AFL++ compilers extract "tokens" from source code to build
> automatic dictionaries (autodict). Most modern fuzzers now do this.
>
> What's interesting here is that AFL++ also retains the mutations which
> are lexically destructive (i.e., some mutations are still not aware of
> where tokens are and will corrupt those input regions). But that's
> actually advantageous, because it will test the code which checks for
> lexical validity.
>
> The next step up would be to produce syntactically valid inputs. For
> this, you'll need to move up to other fuzzers; tools like grammarinator
> or Nautilus handle this: https://github.com/renatahodovan/grammarinator
> (note that these are a bit above context-free, but w/e)
>
> Just like with AFL++ fuzzing "just outside" of lexical validity, you
> want inputs that are "just outside" of syntactic validity. There's some
> more recent works on that, as well:
> https://doi.org/10.1145/3605157.3605170
>
> Now, here's the rub: with unstructured and lexical mutations, you
> reason about the input linearly: you basically work with sequences in
> either case. But when we use syntactic fuzzers, we reason about the
> fuzzer in a tree-like format; you have some AST that you mutate rather
> than some sequence. What this means is that the concept of "distance"
> changes; two inputs which are structurally similar might be
> significantly different unstructured. This means that you _actually
> explore the program differently_; you are literally traversing the
> program's "behaviour space" (if you want to think about it so
> abstractly) in a different way. As a result, the way you progress
> through a program changes, and even if you reach the same coverage, the
> means by which you reached them is different. As a result, you likely
> test the same code regions in different ways. See:
> https://doi.org/10.1145/3742894
>
> Moreover, syntactic fuzzers (i.e., those which mutate on the AST)
> cannot be easily used in combination with lexical or unstructured
> fuzzers; the structured nature means that it is very difficult to
> ensure that a lexical mutation doesn't make the structure just
> impossible to reason about.
>
> When we move up to semantics, the same step happens again. Now, instead
> of structure, you might need to reason about the relationships between
> AST nodes -- oftentimes in ways disconnected to the syntax. Consider
> writing something like a semantic mutator for C programs. Instead of
> reasoning about AST, you might prefer to reason about mutating typed
> dataflow graphs, which you then "lower" into AST, which you then
> "lower" into lexical components, which you then "lower" into
> unstructured data. For the same reasons as before, you want to produce
> inputs "just outside" of semantic validity, and the representation of
> inputs is not compatible with other levels.
>
> Different representations of program semantics matter. In the XML case,
> they changed from semantics of the _input_ to semantics of the _API_.
> This is significant, because it changes how mutations are applied
> again. Even stuck with the admittedly very, very limited metric that is
> coverage, exploring this space with different "input projections" (as
> I've taken to calling it) is meaningful.
>
> This, all together, is why neither one harness nor one fuzzer is
> enough. For starters, you need different harnesses to test different
> parts of the program. Then, you need different fuzzers to produce
> different kinds of data for each harness. Then, at a semantic level,
> you need different fuzzers to produce different representations of the
> semantics of the program under test.
>
> If you want to get started with semantic fuzzing, IMO the best fuzzer
> is the one you write to accomplish your task. LibAFL or FANDANGO might
> help, but you'll very need a home-grown solution.
> - https://github.com/AFLplusplus/LibAFL
> - https://github.com/fandango-fuzzer/fandango
>
> I've seen some evidence to suggest that fuzzers like FANDANGO (which is
> syntactic + constraints) might not be able to construct deeply
> semantically complex inputs vis-a-vis the dataflow example from before,
> so consider playing with it, but I really suggest making a higher-order
> input representation + mutations.
>
> The most critical takeaway from my talk is one that you've already
> taken: you _need_ to think about how the fuzzer interacts with the
> program under test. It sounds like you folks have that covered, though!
>
> Do let me know if I can help clarify anything else, or point you to
> useful resources.
>
> --
> Addison Crump | PhD Candidate
> CISPA – Helmholtz-Zentrum für Informationssicherheit gGmbH
> Stuhlsatzenhaus 5, 66123 Saarbrücken, Deutschland
> Room 2.17
> p +49 681 87083 2876
> e [email protected] | w www.cispa.de
>

Re: Re: Fuzzing Parquet C++

Reply via email to