Re: How do you debug a code-generated aggregate?

Jack Goodson Tue, 13 Feb 2024 13:25:18 -0800

Apologies if it wasn't clear, I was meaning the difficulty of debugging,
not floating point precision :)


On Wed, Feb 14, 2024 at 2:03 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Jack,
>
> "....  most SQL engines suffer from the same issue... ""
>
> Sure. This behavior is not a bug, but rather a consequence of the
> limitations of floating-point precision. The numbers involved in the
> example (see SPIP [SPARK-47024] Sum of floats/doubles may be incorrect
> depending on partitioning - ASF JIRA (apache.org)
> <https://issues.apache.org/jira/browse/SPARK-47024> exceed the precision
> of the double-precision floating-point representation used by default in
> Spark and others Interesting to have a look and test the code
>
> This is the code
>
> SUM_EXAMPLE = [
> (1.0,), (0.0,), (1.0,), (9007199254740992.0,), ] spark = (
> SparkSession.builder .config("spark.log.level", "ERROR") .getOrCreate() )
> def compare_sums(data, num_partitions): df = spark.createDataFrame(data,
> "val double").coalesce(1) result1 = df.agg(sum(col("val"))).collect()[0][0]
> df = spark.createDataFrame(data, "val double").repartition(num_partitions) 
> *result2
> = df.agg(sum(col("val"))).collect()[0][0]* assert result1 == result2,
> f"{result1}, {result2}" if __name__ == "__main__":
> print(compare_sums(SUM_EXAMPLE, 2))
> In Python, floating-point numbers are implemented using the IEEE 754
> standard,
> <https://stackoverflow.com/questions/73340696/how-is-pythons-decimal-and-other-precise-decimal-libraries-implemented-and-wh>which
> has a limited precision. When one performs operations with very large
> numbers or numbers with many decimal places, one may encounter precision
> errors.
>
> print(compare_sums(SUM_EXAMPLE, 2)) File "issue01.py", line 23, in
> compare_sums assert result1 == result2, f"{result1}, {result2}"
> AssertionError: 9007199254740994.0, 9007199254740992.0
> In the aforementioned case, the result of the aggregation (sum) is
> affected by the precision limits of floating-point representation. The
> difference between 9007199254740994.0, 9007199254740992.0. is within the
> expected precision limitations of double-precision floating-point numbers.
>
> The likely cause in this scenario in this example
>
> When one performs an aggregate operation like sum on a DataFrame, the
> operation may be affected by the order of the data.and the case here, the
> order of data can be influenced by the number of partitions in
> Spark..result2 above creates a new DataFrame df with the same data but
> explicitly repartition it into two partitions
> (repartition(num_partitions)). Repartitioning can shuffle the data across
> partitions, introducing a different order for the subsequent aggregation.
> The sum operation is then performed on the data in a different order,
> leading to a slightly different result from result1
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 13 Feb 2024 at 03:06, Jack Goodson <jackagood...@gmail.com> wrote:
>
>> I may be ignorant of other debugging methods in Spark but the best
>> success I've had is using smaller datasets (if runs take a long time) and
>> adding intermediate output steps. This is quite different from application
>> development in non-distributed systems where a debugger is trivial to
>> attach but I believe it's one of the trade offs on using a system like
>> Spark for data processing, most SQL engines suffer from the same issue. If
>> you do believe there is a bug in Spark using the explain function like
>> Herman mentioned helps as well as looking at the Spark plan in the Spark UI
>>
>> On Tue, Feb 13, 2024 at 9:24 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> OK, I figured it out. The details are in SPARK-47024
>>> <https://issues.apache.org/jira/browse/SPARK-47024> for anyone who’s
>>> interested.
>>>
>>> It turned out to be a floating point arithmetic “bug”. The main reason I
>>> was able to figure it out was because I’ve been investigating another,
>>> unrelated bug (a real bug) related to floats, so these weird float corner
>>> cases have been top of mind.
>>>
>>> If it weren't for that, I wonder how much progress I would have made.
>>> Though I could inspect the generated code, I couldn’t figure out how to get
>>> logging statements placed in the generated code to print somewhere I could
>>> see them.
>>>
>>> Depending on how often we find ourselves debugging aggregates like this,
>>> it would be really helpful if we added some way to trace the aggregation
>>> buffer.
>>>
>>> In any case, mystery solved. Thank you for the pointer!
>>>
>>>
>>> On Feb 12, 2024, at 8:39 AM, Herman van Hovell <her...@databricks.com>
>>> wrote:
>>>
>>> There is no really easy way of getting the state of the aggregation
>>> buffer, unless you are willing to modify the code generation and sprinkle
>>> in some logging.
>>>
>>> What I would start with is dumping the generated code by calling
>>> explain('codegen') on the DataFrame. That helped me to find similar issues
>>> in most cases.
>>>
>>> HTH
>>>
>>> On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Consider this example:
>>>>
>>>> >>> from pyspark.sql.functions import sum>>> 
>>>> >>> spark.range(4).repartition(2).select(sum("id")).show()+-------+|sum(id)|+-------+|
>>>> >>>   6    |+-------+
>>>>
>>>>
>>>> I’m trying to understand how this works because I’m investigating a bug
>>>> in this kind of aggregate.
>>>>
>>>> I see that doProduceWithoutKeys
>>>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L98>
>>>>  and doConsumeWithoutKeys
>>>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L193>
>>>>  are
>>>> called, and I believe they are responsible for computing a declarative
>>>> aggregate like `sum`. But I’m not sure how I would debug the generated
>>>> code, or the inputs that drive what code gets generated.
>>>>
>>>> Say you were running the above example and it was producing an
>>>> incorrect result, and you knew the problem was somehow related to the sum.
>>>> How would you troubleshoot it to identify the root cause?
>>>>
>>>> Ideally, I would like some way to track how the aggregation buffer
>>>> mutates as the computation is executed, so I can see something roughly 
>>>> like:
>>>>
>>>> [0, 1, 2, 3]
>>>> [1, 5]
>>>> [6]
>>>>
>>>>
>>>> Is there some way to trace a declarative aggregate like this?
>>>>
>>>> Nick
>>>>
>>>>
>>>

Re: How do you debug a code-generated aggregate?

Reply via email to