2010YOUY01 commented on issue #14535: URL: https://github.com/apache/datafusion/issues/14535#issuecomment-2712646798
> Hello, I am interested in applying to work on this project for GSoC. After reading through [#11030](https://github.com/apache/datafusion/issues/11030) , it looks like the three testing oracles that have been implemented from SQLancer are NoREC, TLP, and PQS. Were those chosen because they were the easiest to implement, or was there something about how they test Datafusion specifically? 👋🏼 They're implemented first because - All of them are general-purpose algorithms, which should work in most systems: - Taking TLP for example, it's targeting edge case value handling (NULLs) so it should work well for DataFusion. - NoREC: It's checking the consistency between optimized (by predicate pushdown) path and non-optimized path, and also it's checking the consistency between how a same predicate is evaluated in `select expr` and `where expr`, it also caught several bugs for DataFusion - PQS: I don't have very good intuition on why this one should work and I think it has caught 0 or 1 bug 🤦🏼 Perhaps I'm missing something. - Yes, they're also very easy to implement. To make fuzzing more specific to DataFusion, I think the most needed is configuration fuzzing or data source fuzzing. To make `DataFusion` more performant. For the same executor there are many specialized execution paths, and they're controlled by turning different configuration knobs in https://datafusion.apache.org/user-guide/configs.html However, they're quite hard to implement due to the complexity. Given a randomly generated query, if we pick a random configuration for every option, it's very likely to fail, becuase this configuration is invalid, I believe given a specific query only a small subset of the configurations are relevant. So now we implement this kind of configuration fuzzing separately, for example https://github.com/apache/datafusion/blob/main/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs I wish configuration fuzzing can be integrated into `datafusion-sqlancer`, but it still has a long way to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org