andygrove commented on PR #4749: URL: https://github.com/apache/datafusion-comet/pull/4749#issuecomment-4849049961
Thanks for the updates. The divisor guard plus the two new tests address all three of my earlier comments. On the precise-decimal vs double-average question I raised, I dug into it further and I am now satisfied there is no real divergence to worry about. I built the worst case on purpose: 32 rows in one partition (a power-of-two count, so `sum / count` is exact in `double`), thirty-one `0.0000` and one `0.0003`, where the average `3 / 32 / 10^4 = 0.000009375` lands exactly on the HALF_UP boundary at the result scale. Both Spark and Comet return `0.00000938`. The reason is Spark's cast from the `double` result back to decimal. It goes through `Decimal.apply(BigDecimal.valueOf(d))` in `Cast.scala`, which uses `Double.toString`, the shortest round-trip string. That recovers `0.000009375` rather than the exact binary value `0.00000937499...`, so it rounds up and matches Comet's exact `avg_decimal`. This is exactly why Spark gates the rewrite on `prec + 4 <= 15` (`MAX_DOUBLE_DIGITS`): within 15 significant digits the `double` round-trips to a unique short decimal equal to the exactly-rounded value. To be thorough I also ran a large targeted search over rounding-boundary inputs across a range of scales and counts and found zero mismatches. So the approach of restoring the decimal child is sound within the range the rewrite applies to, and the fuzz test is good defensive coverage on top of that. No further concerns from me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
