Re: [I] Fuzz tests for Arrow/Parquet [arrow-rs]

via GitHub Sat, 02 May 2026 21:39:35 -0700


youichi-uda commented on issue #5332:
URL: https://github.com/apache/arrow-rs/issues/5332#issuecomment-4365402831


   I'd like to take this on, building on the direction agreed in this thread 
(cargo-fuzz + libFuzzer; panics treated as bugs to fix; memory-blowup / 
infinite-loop as the higher-severity targets).
   
   **Concrete plan**
   
   1. Add `fuzz/` directories under the most security-relevant parsers, 
prioritized by past trouble area: `parquet/fuzz/`, `arrow-ipc/fuzz/`, then 
`arrow-json/`, `arrow-csv/`, `arrow-avro/`. One target per entry point (e.g. 
`parquet_arrow_reader`, `parquet_thrift_decode`, `ipc_stream_reader`).
   2. Seed corpora from `apache/arrow-testing` (`parquet/fuzzing/`, 
`arrow-ipc-stream/`, `arrow-ipc-file/`, `csv/fuzzing/`), plus minimal 
hand-crafted dictionaries.
   3. Wire **ClusterFuzzLite** in GitHub Actions for short PR-time fuzz + 
nightly batch + corpus pruning, so regressions get caught before merge.
   4. Open the **OSS-Fuzz** integration PR (`projects/arrow-rs/`) once (1)–(3) 
land, mirroring the layout of existing Rust workspace integrations like 
`gitoxide`. Coverage target ≥80% on harness-touched modules per the Integration 
Rewards bar.
   
   **What this would catch — concretely**
   
   I've been hitting the exact memory-blowup class of bug this issue calls out:
   
   - #9868 — parquet thrift `read_thrift_vec` allocating `Vec::with_capacity` 
from an attacker-controlled list length without bounding by remaining input → 
DoS via tiny crafted file.
   - #9869 — arrow-ipc `MessageReader` sizing allocations from a header field 
rather than actual stream bytes.
   - #9874 tracks the broader parquet thrift parser hardening surface.
   
   Both reproduce in seconds under cargo-fuzz with a small seed corpus, so a 
harness pays for itself immediately rather than only finding things in long 
ClusterFuzz runs.
   
   **Two questions before I start writing harnesses** (cc @alamb @tustvold 
@crepererum @emkornfield):
   
   1. **Layout preference**: per-crate `fuzz/` dirs (gitoxide-style) vs. a 
single top-level `fuzz/` workspace? I'd lean per-crate — keeps `parquet`'s fuzz 
target buildable without pulling in the IPC tree, matches how each crate ships 
to crates.io, and lets the OSS-Fuzz `build.sh` use the standard `find . -type d 
-name fuzz` pattern.
   2. **`primary_contact` / `auto_ccs` for OSS-Fuzz `project.yaml`**: OSS-Fuzz 
wants Google-account emails on file. Who from the PMC would want crash 
notifications? Happy to default to a maintainer-monitored alias if there's a 
preference; otherwise I can list myself + auto_ccs to whoever opts in here.
   
   Happy to adjust scope / sequencing — flagging here first per @emkornfield's 
original "step 2 = oss-fuzz" framing rather than just pushing a PR cold. xref 
#9358.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Fuzz tests for Arrow/Parquet [arrow-rs]

Reply via email to