Hello all,
Le 15/12/2025 à 09:59, Antoine Pitrou a écrit :
I am also toying with the idea of a encoding/decoding fuzzer that
roundtrips data (see "function/inverse pairs" in
https://blog.regehr.org/archives/856). The question becomes in which
format the fuzzer would accept input data for the encoding step (as
Parquet files, which would mean a decoding/encoding/decoding roundtrip?
as Arrow IPC files, which are a simpler format?).
As a heads up, we have now merged a 3-way encoding roundtrip fuzzer:
https://github.com/apache/arrow/pull/49374
Quick copy-paste from the PR description:
"""
Add a fuzz target that does a 3-way encoding roundtrip:
1. decodes its payload using a source encoding
2. re-encodes the values using a roundtrip encoding
3. re-decodes using the same roundtrip encoding, and checks that the
values read are identical
The motivation is to ensure that encoding and decoding can roundtrip for
any valid data, while the full-file Parquet fuzzer only tests reading
the data. Furthermore, we postulate that a smaller workload that runs on
a smaller search space like this might lead to a faster exploration of
the search space (but we don't know for sure).
The reason for the two encodings (source and roundtrip) is that not all
encodings imply the same input structure (grammar, etc.) and using
another source encoding (especially PLAIN, which has very little
structure) might allow the fuzzer to explore the logical search space
(i.e. values) in different ways.
"""
Regards
Antoine.