andygrove commented on issue #913: URL: https://github.com/apache/arrow-datafusion/issues/913#issuecomment-906479167
This paper (SparkFuzz: Searching Correctness Regressions in Modern Query Engines) would be worth a read too for anyone interested to learn how Databricks uses query fuzzing with Spark. https://bogdanghit.github.io/publications/sparkfuzz.pdf I have been doing some query fuzzing myself in my day job, to compare Spark with Spark on GPU (using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids)). My approach there was to generate logical query plans directly. I had been contemplating doing something similar with DataFusion/Ballista by generating random plans in Rust and encoding them to protobuf using the Ballista serde module and then writing Scala code to read these protobuf files and translate them to Spark plans. I have an old proof-of-concept of some of this already in my [How Query Engines Work](https://github.com/andygrove/how-query-engines-work) repo. - [Generating random query plans](https://github.com/andygrove/how-query-engines-work/blob/main/jvm/fuzzer/src/main/kotlin/Fuzzer.kt) - [Translating protobuf query plan to Spark](https://github.com/andygrove/how-query-engines-work/blob/main/spark/executor/src/main/scala/org/ballistacompute/spark/executor/BallistaSparkContext.scala) With the new Arrow Compute IR proposal, an approach along these lines would be useful for having fuzzing tools that work across Arrow implementations as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
