youichi-uda commented on issue #21998:
URL: https://github.com/apache/datafusion/issues/21998#issuecomment-4367772092

   Thanks @2010YOUY01 — that's a fair read of where byte-level fuzzing pays off 
and where it doesn't, and I agree with most of it. A couple of clarifications 
and a smaller proposal.
   
   **On the existing `apache/datafusion-sqlparser-rs/fuzz`**
   
   I missed that — sorry. It uses honggfuzz with a single `fuzz_parse_sql` 
target that round-robins across all 14 dialects, which is a perfectly 
reasonable design for the parser and is more comprehensive than the `parse_sql` 
harness I'd prototyped. I don't think layering a second cargo-fuzz target on 
top of that crate buys much.
   
   **On byte-level fuzzing for SQL specifically**
   
   Your read matches mine. For comparison — the binary-format harnesses I've 
been prototyping in parallel (apache/arrow-rs#5332 / pola-rs/polars#27488) 
found 6 distinct DoS-class bugs over the past two days (fix PRs: 
arrow-rs#9883/9884/9886/9887, polars#27489/27490). For the SQL parser side, a 
5-minute libFuzzer run on `DFParser::parse_sql` reached 4,377 edges and 20,378 
features but produced no crashes — entirely consistent with what you'd expect 
from a parser that's been under property-based + honggfuzz coverage already.
   
   So the leverage isn't comparable, and I don't think DataFusion needs the 
kind of harness-density push the binary-parser side does. Withdrawing the broad 
proposal.
   
   **Smaller scope I'd actually suggest**
   
   Two narrower things, both fine to decline:
   
   1. **Bring `apache/datafusion-sqlparser-rs/fuzz` under OSS-Fuzz.** The 
honggfuzz harness already exists; OSS-Fuzz would just give it persistent 
corpus, 24/7 compute, and an automated crash-notification path. ~zero net work 
in this repo or in `datafusion-sqlparser-rs` itself; net change is a new 
`projects/datafusion-sqlparser-rs/` directory in google/oss-fuzz that drives 
the existing harness. Happy to send that as a standalone proposal in 
`datafusion-sqlparser-rs` if it's useful.
   
   2. **Corpus-mutation harness only.** As you suggested, a small target that 
takes a short byte slice and uses it to mutate the existing test SQL corpus 
already in `datafusion/sql/tests`, then re-parses. Stays in this repo, tiny 
surface, no big harness fleet. Sketch:
   
      ```rust
      fuzz_target!(|data: &[u8]| {
          let base = TEST_SQL_CORPUS[(data.first().copied().unwrap_or(0) as 
usize) % TEST_SQL_CORPUS.len()];
          let s = mutate_corpus_string(base, &data[1..]);  // small bit-flip 
mutator
          let _ = DFParser::parse_sql(&s);
      });
      ```
   
      This addresses your "valid SQL or close to it" point — every input is a 
near-valid query rather than random bytes.
   
   Either or neither — happy to follow whatever suits the project.
   
   **On how OSS-Fuzz works**
   
   Good question. Short version:
   
   - It's basically Google-hosted ClusterFuzz, free for OSS projects.
   - You ship a `projects/<name>/` directory in `google/oss-fuzz` with a 
`project.yaml` (contacts, sanitizers, fuzzing engines), a `Dockerfile` that 
fetches the source, and a `build.sh` that builds the harnesses into `$OUT/`.
   - Their infra rebuilds your harnesses daily, runs them on a fleet 24/7 with 
persistent corpus, and reports new crashes to the email addresses in 
`project.yaml` (`primary_contact` + `auto_ccs`). Crashes get auto-bisected, 
deduped, and filed as private issues that flip public after the standard 90-day 
disclosure window.
   - Local workflow is `python infra/helper.py {build_image, build_fuzzers, 
check_build, run_fuzzer, coverage} <project>`, which mirrors what their infra 
does so you can iterate on the harness before pushing.
   - Apache projects that already use it for comparison: `projects/arrow` 
(Arrow C++), `projects/librdkafka`, `projects/parquet-mr`, etc. Same shape, 
just pointed at a Rust crate instead.
   
   The closest precedent for the Rust workspace shape we'd want here is 
`projects/gitoxide` (auto-discovers per-crate `fuzz/` dirs in a workspace and 
builds each) or `projects/fontations`. Both are referenced from my prototype 
`projects/datafusion/` if you want to look — though as above, I'd actually 
recommend pointing it at `datafusion-sqlparser-rs` instead and leaving 
DataFusion proper alone.
   
   ---
   
   Net: I'd close this issue or downscope it to "OSS-Fuzz integration of the 
existing `datafusion-sqlparser-rs` honggfuzz harness, tracked there." Whichever 
you prefer. cc @alamb @andygrove @comphead @ozankabak in case anyone has a 
different read.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to