OK, I figured it out. The details are in SPARK-47024 
<https://issues.apache.org/jira/browse/SPARK-47024> for anyone who’s interested.

It turned out to be a floating point arithmetic “bug”. The main reason I was 
able to figure it out was because I’ve been investigating another, unrelated 
bug (a real bug) related to floats, so these weird float corner cases have been 
top of mind.

If it weren't for that, I wonder how much progress I would have made. Though I 
could inspect the generated code, I couldn’t figure out how to get logging 
statements placed in the generated code to print somewhere I could see them.

Depending on how often we find ourselves debugging aggregates like this, it 
would be really helpful if we added some way to trace the aggregation buffer.

In any case, mystery solved. Thank you for the pointer!


> On Feb 12, 2024, at 8:39 AM, Herman van Hovell <her...@databricks.com> wrote:
> 
> There is no really easy way of getting the state of the aggregation buffer, 
> unless you are willing to modify the code generation and sprinkle in some 
> logging.
> 
> What I would start with is dumping the generated code by calling 
> explain('codegen') on the DataFrame. That helped me to find similar issues in 
> most cases.
> 
> HTH
> 
> On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas <nicholas.cham...@gmail.com 
> <mailto:nicholas.cham...@gmail.com>> wrote:
>> Consider this example:
>> >>> from pyspark.sql.functions import sum
>> >>> spark.range(4).repartition(2).select(sum("id")).show()
>> +-------+
>> |sum(id)|
>> +-------+
>> |  6    |
>> +-------+
>> 
>> I’m trying to understand how this works because I’m investigating a bug in 
>> this kind of aggregate.
>> 
>> I see that doProduceWithoutKeys 
>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L98>
>>  and doConsumeWithoutKeys 
>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L193>
>>  are called, and I believe they are responsible for computing a declarative 
>> aggregate like `sum`. But I’m not sure how I would debug the generated code, 
>> or the inputs that drive what code gets generated.
>> 
>> Say you were running the above example and it was producing an incorrect 
>> result, and you knew the problem was somehow related to the sum. How would 
>> you troubleshoot it to identify the root cause?
>> 
>> Ideally, I would like some way to track how the aggregation buffer mutates 
>> as the computation is executed, so I can see something roughly like:
>> [0, 1, 2, 3]
>> [1, 5]
>> [6]
>> 
>> Is there some way to trace a declarative aggregate like this?
>> 
>> Nick
>> 

Reply via email to