theirix commented on PR #19874: URL: https://github.com/apache/datafusion/pull/19874#issuecomment-3887735901
Hello @Jefffrey, I agree, there aren't many functions that support both decimal types. Most likely, it will be only `gcd` and `lcm`, aside from user-defined functions. For other functions, the changes are mostly neutral. The rationale for this PR is to support these symmetric functions for both decimal types and avoid scaling issues, as shown below. This implementation of `gcd` (`lcm` can be done similarly) with this PR is shown here - https://github.com/apache/datafusion/compare/f9ffc8242c3c253efdc3cd73e2a9bebbe6de0393...theirix:datafusion:gcd-decimal-new-api . It uses a new API and utilises a new arrow num_traits support, so decimals are treated as any other numbers. When I tried to [implement it using the old function](https://github.com/theirix/datafusion/commit/0768bcdcaa2c1ab915de3da3b8724c35edcf994a), I immediately encountered a problem with the old function `calculate_binary_decimal_math` because it doesn't take scale into account, so the result is incorrect. Shortly, it failed with `select gcd(2::decimal(38, 0), 3::decimal(38, 0));`, with a cast of a proper input decimal (`3::Decimal`) to a raw arrow type `Decimal128Type`, abruptly adding default scale `DECIMAL_DEFAULT_SCALE=10`, so it becomes `30000000000`. It didn't happen if the second argument is a non-decimal number (pow, log, etc). Logs: <details> ``` [2026-02-09T22:17:42Z INFO datafusion_functions::math::gcd] invoke gcd with PrimitiveArray<Decimal128(38, 0)> [ 2, ] and Scalar(Decimal128(Some(3),38,0)) [2026-02-09T22:17:42Z INFO datafusion_functions::utils] Calculating binary math with left PrimitiveArray<Decimal128(38, 0)> [ 2, ] and right Scalar(Decimal128(Some(30000000000),38,10)) [2026-02-09T22:17:42Z INFO datafusion_functions::math::gcd] euclid_gcd_unsigned a=2 b=30000000000 -> Ok(2) [2026-02-09T22:17:42Z INFO datafusion_functions::utils] Calculating binary math with result PrimitiveArray<Decimal128(38, 10)> [ 2, ] ``` </details> This implementation avoids this bug by properly analysing scale, so rescaling shouldn't be handled in the UDF code. If you check the implementation closely, it uses rescaling only for the decimal-decimal case. Other cases have code almost identical to the old function. I can see two approaches for implementing decimal-decimal math functions: 1. (as in this PR). Fix the utility function to be scale-aware to prevent correctness issues. It has more code, but handles all cases properly with casting, scaling, and error checking. UDFs remain simple. 2. Handle it per-UDF. Move this logic to decimal-decimal UDFs (`gcd`, `lcm`). It keeps the utils function lean. At the same time, it adds more complex logic on the UDF side, for both of them. It may also lead to similar bugs and edge cases. I favour the first approach for consistency, but I'm happy to refactor with the second approach if you find it more suitable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
