Consider this example: >>> from pyspark.sql.functions import sum >>> spark.range(4).repartition(2).select(sum("id")).show() +-------+ |sum(id)| +-------+ | 6 | +-------+ I’m trying to understand how this works because I’m investigating a bug in this kind of aggregate.
I see that doProduceWithoutKeys <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L98> and doConsumeWithoutKeys <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L193> are called, and I believe they are responsible for computing a declarative aggregate like `sum`. But I’m not sure how I would debug the generated code, or the inputs that drive what code gets generated. Say you were running the above example and it was producing an incorrect result, and you knew the problem was somehow related to the sum. How would you troubleshoot it to identify the root cause? Ideally, I would like some way to track how the aggregation buffer mutates as the computation is executed, so I can see something roughly like: [0, 1, 2, 3] [1, 5] [6] Is there some way to trace a declarative aggregate like this? Nick