colinmarc opened a new issue, #17970:
URL: https://github.com/apache/datafusion/issues/17970

   ### Describe the bug
   
   Hi!
   
   We're trying to support an unnamed BI tool on top of datafusion, which is 
fond of generating symmetric aggregates that look like this:
   
   ```sql
   SELECT (
       COALESCE(
           (
                   SUM(DISTINCT
                           (CAST(FLOOR(COALESCE( o_totalprice , 0) * 1000000) 
AS DECIMAL(65,0)))
                       +   _hex_to_bigint('x' || MD5( l_orderkey::varchar 
))::DECIMAL(65,0)
                       *   18446744073709551616
                       +   _hex_to_bigint('x' || SUBSTR(MD5( 
l_orderkey::varchar ) , 17))::DECIMAL(65,0)
                   )
               -   SUM(DISTINCT
                           _hex_to_bigint('x' || MD5( l_orderkey::varchar 
))::DECIMAL(65,0)
                       *   18446744073709551616
                       +   _hex_to_bigint('x' || SUBSTR(MD5( 
l_orderkey::varchar ), 17))::DECIMAL(65,0)
                   )
           ) / 1000000, 0) -- coalesce
   ) / COUNT(DISTINCT l_orderkey) as average_total
   FROM orders left join lineitem on (l_orderkey = o_orderkey)
   ```
   
   Here is some documentation from a totally different, 100% unrelated BI tool 
that explains why the above SQL is so cursed: 
https://cloud.google.com/looker/docs/best-practices/understanding-symmetric-aggregates
   
   Roughly what it's doing is computing a really big sum twice, and subtracting 
it from itself. The numbers it's summing are hashes of a primary key. To 
generate big unique numbers, it's doing two `md5` hashes and bitshifting the 
first to the left.
   
   Unfortunately, the results from DF seem to be nondeterministic, and often 
negative. The issue seems to be with floating point math, related to the 
literal `18446744073709551616` (2^64). The BI tool seems to be expecting that 
the literal will be typed as a `DECIMAL(65, 0)`, since the left hand side of 
the multiplication is also `DECIMAL(65, 0)`, but DF is picking that up as a 
float:
   
   ```
   <snip> AS Decimal256(65, 0)) * Float64(18446744073709552000) <snip>
   ```
   
   This is especially weird because the literal in the plan is a completely 
different number. 😕
   
   To aid in debugging, I've included a small reproducer here: 
https://github.com/colinmarc/df-symmetric-aggregate
   
   Note that if you change the literal in the generated SQL to  
`18446744073709551615` (`2^64 - 1`), it gets picked up as a `UInt64`, which is 
still odd, but works, produces the correct value, and is deterministic.
   
   Thanks in advance!
   
   
   
   ### To Reproduce
   
   ```
   git clone https://github.com/colinmarc/df-symmetric-aggregate
   cd df-symmetric-aggregate
   cargo test
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to