Re: Fuzzing Parquet C++

Antoine Pitrou Tue, 03 Mar 2026 08:07:36 -0800


Hello all,

Le 15/12/2025 à 09:59, Antoine Pitrou a écrit :

I am also toying with the idea of a encoding/decoding fuzzer that
roundtrips data (see "function/inverse pairs" in
https://blog.regehr.org/archives/856). The question becomes in which
format the fuzzer would accept input data for the encoding step (as
Parquet files, which would mean a decoding/encoding/decoding roundtrip?
as Arrow IPC files, which are a simpler format?).


As a heads up, we have now merged a 3-way encoding roundtrip fuzzer:
https://github.com/apache/arrow/pull/49374

Quick copy-paste from the PR description:
"""
Add a fuzz target that does a 3-way encoding roundtrip:

1. decodes its payload using a source encoding
2. re-encodes the values using a roundtrip encoding

3. re-decodes using the same roundtrip encoding, and checks that thevalues read are identical

The motivation is to ensure that encoding and decoding can roundtrip forany valid data, while the full-file Parquet fuzzer only tests readingthe data. Furthermore, we postulate that a smaller workload that runs ona smaller search space like this might lead to a faster exploration ofthe search space (but we don't know for sure).

The reason for the two encodings (source and roundtrip) is that not allencodings imply the same input structure (grammar, etc.) and usinganother source encoding (especially PLAIN, which has very littlestructure) might allow the fuzzer to explore the logical search space(i.e. values) in different ways.

"""

Regards

Antoine.

Re: Fuzzing Parquet C++

Reply via email to