andygrove edited a comment on issue #913:
URL: 
https://github.com/apache/arrow-datafusion/issues/913#issuecomment-906479167


   This paper would be worth a read too for anyone interested to learn how 
Databricks uses query fuzzing with Spark.
   
   - [SparkFuzz: Searching Correctness Regressions in Modern Query 
Engines](https://bogdanghit.github.io/publications/sparkfuzz.pdf)
   
   I have been doing some query fuzzing myself in my day job, to compare Spark 
with Spark on GPU (using the [RAPIDS Accelerator for Apache 
Spark](https://github.com/NVIDIA/spark-rapids)). My approach there was to 
generate logical query plans directly. 
   
   I had been contemplating doing something similar with DataFusion/Ballista by 
generating random plans in Rust and encoding them to protobuf using the 
Ballista serde module and then writing Scala code to read these protobuf files 
and translate them to Spark plans. I have an old proof-of-concept of some of 
this already in my [How Query Engines 
Work](https://github.com/andygrove/how-query-engines-work) repo.
   
   - [Generating random query 
plans](https://github.com/andygrove/how-query-engines-work/blob/main/jvm/fuzzer/src/main/kotlin/Fuzzer.kt)
   - [Translating protobuf query plan to 
Spark](https://github.com/andygrove/how-query-engines-work/blob/main/spark/executor/src/main/scala/org/ballistacompute/spark/executor/BallistaSparkContext.scala)
   
   With the new Arrow Compute IR proposal, an approach along these lines would 
be useful for having fuzzing tools that work across Arrow implementations as 
well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to