There is no really easy way of getting the state of the aggregation buffer,
unless you are willing to modify the code generation and sprinkle in some
logging.

What I would start with is dumping the generated code by calling
explain('codegen') on the DataFrame. That helped me to find similar issues
in most cases.

HTH

On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Consider this example:
>
> >>> from pyspark.sql.functions import sum>>> 
> >>> spark.range(4).repartition(2).select(sum("id")).show()+-------+|sum(id)|+-------+|
> >>>   6    |+-------+
>
> I’m trying to understand how this works because I’m investigating a bug in
> this kind of aggregate.
>
> I see that doProduceWithoutKeys
> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L98>
>  and doConsumeWithoutKeys
> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L193>
>  are
> called, and I believe they are responsible for computing a declarative
> aggregate like `sum`. But I’m not sure how I would debug the generated
> code, or the inputs that drive what code gets generated.
>
> Say you were running the above example and it was producing an incorrect
> result, and you knew the problem was somehow related to the sum. How would
> you troubleshoot it to identify the root cause?
>
> Ideally, I would like some way to track how the aggregation buffer mutates
> as the computation is executed, so I can see something roughly like:
>
> [0, 1, 2, 3]
> [1, 5]
> [6]
>
> Is there some way to trace a declarative aggregate like this?
>
> Nick
>
>

Reply via email to