Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Jack Goodson
I may be ignorant of other debugging methods in Spark but the best success I've had is using smaller datasets (if runs take a long time) and adding intermediate output steps. This is quite different from application development in non-distributed systems where a debugger is trivial to attach but I

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas
OK, I figured it out. The details are in SPARK-47024 for anyone who’s interested. It turned out to be a floating point arithmetic “bug”. The main reason I was able to figure it out was because I’ve been investigating another, unrelated bug (a

Re: Extracting Input and Output Partitions in Spark

2024-02-12 Thread Aditya Sohoni
Sharing an example since a few people asked me off-list: We have stored the partition details in the read/write nodes of the physical plan. So this can be accessed via the plan like plan.getInputPartitions or plan.getOutputPartitions, which internally loops through the nodes in the plan and

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Herman van Hovell
There is no really easy way of getting the state of the aggregation buffer, unless you are willing to modify the code generation and sprinkle in some logging. What I would start with is dumping the generated code by calling explain('codegen') on the DataFrame. That helped me to find similar