How do you debug a code-generated aggregate?

Nicholas Chammas Sun, 11 Feb 2024 19:27:33 -0800

Consider this example:
>>> from pyspark.sql.functions import sum
>>> spark.range(4).repartition(2).select(sum("id")).show()
+-------+
|sum(id)|
+-------+
|  6    |
+-------+
I’m trying to understand how this works because I’m investigating a bug in this 
kind of aggregate.


I see that doProduceWithoutKeys 
<https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L98>
 and doConsumeWithoutKeys 
<https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L193>
 are called, and I believe they are responsible for computing a declarative 
aggregate like `sum`. But I’m not sure how I would debug the generated code, or 
the inputs that drive what code gets generated.

Say you were running the above example and it was producing an incorrect 
result, and you knew the problem was somehow related to the sum. How would you 
troubleshoot it to identify the root cause?

Ideally, I would like some way to track how the aggregation buffer mutates as 
the computation is executed, so I can see something roughly like:
[0, 1, 2, 3]
[1, 5]
[6]
Is there some way to trace a declarative aggregate like this?

Nick

How do you debug a code-generated aggregate?

Reply via email to