Hello all,

Le 15/12/2025 à 09:59, Antoine Pitrou a écrit :
I am also toying with the idea of a encoding/decoding fuzzer that
roundtrips data (see "function/inverse pairs" in
https://blog.regehr.org/archives/856). The question becomes in which
format the fuzzer would accept input data for the encoding step (as
Parquet files, which would mean a decoding/encoding/decoding roundtrip?
as Arrow IPC files, which are a simpler format?).

As a heads up, we have now merged a 3-way encoding roundtrip fuzzer:
https://github.com/apache/arrow/pull/49374

Quick copy-paste from the PR description:
"""
Add a fuzz target that does a 3-way encoding roundtrip:

1. decodes its payload using a source encoding
2. re-encodes the values using a roundtrip encoding
3. re-decodes using the same roundtrip encoding, and checks that the values read are identical

The motivation is to ensure that encoding and decoding can roundtrip for any valid data, while the full-file Parquet fuzzer only tests reading the data. Furthermore, we postulate that a smaller workload that runs on a smaller search space like this might lead to a faster exploration of the search space (but we don't know for sure).

The reason for the two encodings (source and roundtrip) is that not all encodings imply the same input structure (grammar, etc.) and using another source encoding (especially PLAIN, which has very little structure) might allow the fuzzer to explore the logical search space (i.e. values) in different ways.
"""

Regards

Antoine.


Reply via email to