Re: [I] Fuzz tests for Arrow/Parquet [arrow-rs]

via GitHub Wed, 06 Nov 2024 01:30:55 -0800


crepererum commented on issue #5332:
URL: https://github.com/apache/arrow-rs/issues/5332#issuecomment-2459115580


   > We've had good experience in parquet-cpp of integrating with 
[google/oss-fuzz](https://github.com/google/oss-fuzz) I haven't looked into 
what supporting Rust would look like exactly but that seems like a reasonable 
place to start?
   
   Before we integrate anything into oss-fuzz, we need to write the fuzz 
harness. We also need to fix the bugs that occur when you run this for an hour 
or so on your developer machine (which can be a laptop), just to make sure we 
sure that we don't end up with SPAM when handing to to oss-fuzz.
   
   Have a look at https://rust-fuzz.github.io/book/introduction.html . Despite 
what I said in 
https://github.com/apache/arrow-rs/issues/5332#issuecomment-1910035219 , I 
think combining that with `libFuzzer` is still the best choice[^libafl]. It 
results in a reasonably efficient fuzzer with a somewhat low barrier of 
entrance. You may want to combine this with 
[`arbitrary`](https://github.com/rust-fuzz/arbitrary) if you need more than 
just a few bytes, e.g. you also want to fuzz parameters or something.
   
   > I think one important item is a philosophical question, of if malformed 
data should trigger a panic. Can I interpret the comment above:
   > 
   > "We should try to avoid it, but its not like UB where it would indicate a 
bug. Panics are just exceptions." as if the fuzzer uncovers a panic, then 
submitting a patch to avoid the panic would be accepted?
   
   I think there are two levels of safety here:
   
   1. **no crash:** It can panic, but it should segfault or something. We have 
some unsafe code in our code base and even with safe code you can still trigger 
out-of-memory errors (e.g. when the input file header says it's 10000TB of data 
but the file is only 1kb large, you shouldn't try to allocate a vector/buffer 
with 10000TB). I think everyone agrees that this is the absolute minimum we 
have to provide.
   2. **no panic:** Here different developers may have different opinions. I 
personally think not panicking for a bunch of seemingly random kernel 
computations (esp. in the DataFusion context) is hard to avoid, but merely 
deserializing a parquet file to arrow should never panic and only use proper 
errors.
   
   That said, I would be OK if we start with a fuzz harness that [catches 
panics](https://doc.rust-lang.org/std/panic/fn.catch_unwind.html), stabilize 
that (e.g. it doesn't crash if you run it on your machine for a while), and 
then in a follow-up remove the panic-catch and see how far we get.
   
   
   [^libafl]: I was somewhat hoping that `libAFL` would evolve a bit faster but 
you still have to manually build it and last time I've checked -- which was a 
few months ago -- it was rather complicated to integrate into cargo-fuzz, even 
with the `libFuzzer` shim.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Fuzz tests for Arrow/Parquet [arrow-rs]

Reply via email to