youichi-uda commented on issue #5332: URL: https://github.com/apache/arrow-rs/issues/5332#issuecomment-4365402831
I'd like to take this on, building on the direction agreed in this thread (cargo-fuzz + libFuzzer; panics treated as bugs to fix; memory-blowup / infinite-loop as the higher-severity targets). **Concrete plan** 1. Add `fuzz/` directories under the most security-relevant parsers, prioritized by past trouble area: `parquet/fuzz/`, `arrow-ipc/fuzz/`, then `arrow-json/`, `arrow-csv/`, `arrow-avro/`. One target per entry point (e.g. `parquet_arrow_reader`, `parquet_thrift_decode`, `ipc_stream_reader`). 2. Seed corpora from `apache/arrow-testing` (`parquet/fuzzing/`, `arrow-ipc-stream/`, `arrow-ipc-file/`, `csv/fuzzing/`), plus minimal hand-crafted dictionaries. 3. Wire **ClusterFuzzLite** in GitHub Actions for short PR-time fuzz + nightly batch + corpus pruning, so regressions get caught before merge. 4. Open the **OSS-Fuzz** integration PR (`projects/arrow-rs/`) once (1)–(3) land, mirroring the layout of existing Rust workspace integrations like `gitoxide`. Coverage target ≥80% on harness-touched modules per the Integration Rewards bar. **What this would catch — concretely** I've been hitting the exact memory-blowup class of bug this issue calls out: - #9868 — parquet thrift `read_thrift_vec` allocating `Vec::with_capacity` from an attacker-controlled list length without bounding by remaining input → DoS via tiny crafted file. - #9869 — arrow-ipc `MessageReader` sizing allocations from a header field rather than actual stream bytes. - #9874 tracks the broader parquet thrift parser hardening surface. Both reproduce in seconds under cargo-fuzz with a small seed corpus, so a harness pays for itself immediately rather than only finding things in long ClusterFuzz runs. **Two questions before I start writing harnesses** (cc @alamb @tustvold @crepererum @emkornfield): 1. **Layout preference**: per-crate `fuzz/` dirs (gitoxide-style) vs. a single top-level `fuzz/` workspace? I'd lean per-crate — keeps `parquet`'s fuzz target buildable without pulling in the IPC tree, matches how each crate ships to crates.io, and lets the OSS-Fuzz `build.sh` use the standard `find . -type d -name fuzz` pattern. 2. **`primary_contact` / `auto_ccs` for OSS-Fuzz `project.yaml`**: OSS-Fuzz wants Google-account emails on file. Who from the PMC would want crash notifications? Happy to default to a maintainer-monitored alias if there's a preference; otherwise I can list myself + auto_ccs to whoever opts in here. Happy to adjust scope / sequencing — flagging here first per @emkornfield's original "step 2 = oss-fuzz" framing rather than just pushing a PR cold. xref #9358. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
