cloud-fan opened a new pull request, #56036:
URL: https://github.com/apache/spark/pull/56036

   ## What changes were proposed in this pull request?
   
   Two semantic-preserving fixes to the SPARK-56627 widened-Cast peel in
   `DecimalAggregates`:
   
   1. **SUM arm**: Replace `Cast(MakeDecimal(_, p + 10, s), bounded(pPrime + 
10, s))`
      with `MakeDecimal(_, min(pPrime + 10, 38), s)`. The merged form's inner
      `MakeDecimal` narrowed the overflow check to `10^(p+10)`, where the
      un-rewritten `SUM(Cast(x, dec(pPrime, s)))` accepted up to
      `10^min(pPrime+10, 38)`. For `pPrime + 10 > 18`, `Decimal.setOrNull` falls
      into the BigDecimal branch and never rejects a Long, so the cleaner form
      preserves the original overflow boundary for all Long-fit sums.
   
   2. **AVG arm**: Change the guard from `p <= 7`
      (`AVG_PEEL_MAX_INNER_PRECISION`) to `pPrime + 4 <= 15`
      (`MAX_DOUBLE_DIGITS`). The merged guard switched `AVG(CAST(x AS 
DECIMAL(pPrime, s)))`
      from full Decimal arithmetic to Double-regime whenever `p <= 7`, including
      the `pPrime > 11` band where un-rewritten was Decimal-exact. The new guard
      ensures the rewrite only fires inside the existing AVG fast-path's Double
      envelope.
   
   ## Why are the changes needed?
   
   A query rewrite should preserve the query's observable semantics. The
   SPARK-56627 rule changed semantics in two ways:
   
   - **SUM**: Inner `MakeDecimal(_, p + 10, s)` could return null (non-ANSI) or
     throw (ANSI) for long-fit sums in \`(10^(p+10), 10^min(pPrime+10, 38))\`.
     Example: \`SUM(CAST(x AS DECIMAL(15, 2)))\` where \`x: DECIMAL(5, 2)\`
     rejected at \`10^15\` instead of \`10^25\` -- reachable around \`~10^10\`
     rows of small-precision input.
   - **AVG**: For \`pPrime > 11, p <= 7\`, the rule switched an exact-Decimal
     computation into Double-regime, visible as last-digit rounding differences
     at any input size. TPC-DS q18 (\`p=7, pPrime=12\`) was affected.
   
   ## Does this PR introduce _any_ user-facing change?
   
   Yes -- restores the un-rewritten semantics for affected queries:
   
   - \`SUM(CAST(x AS DECIMAL(pPrime, s)))\` no longer rejects long-fit sums
     that the un-rewritten expression would have accepted.
   - \`AVG(CAST(x AS DECIMAL(pPrime, s)))\` with \`pPrime + 4 > 
MAX_DOUBLE_DIGITS\`
     is no longer peeled -- the un-rewritten Decimal-exact path is preserved.
     TPC-DS q18 AVG aggregations revert to the un-peeled form (visible in the
     regenerated \`q18\` plan-stability files).
   
   ## How was this patch tested?
   
   - \`DecimalAggregatesSuite\`: 37/37 pass, including updated SUM arm shape
     assertions, the property-test invariants for the new MakeDecimal-only
     form, and the new AVG bound boundary tests.
   - \`DataFrameAggregateSuite\`: SPARK-56627 numerical-equivalence PBTs pass
     under the new AVG generator domain (\`pPrime in [p+1, 11]\`).
   - TPC-DS q18 plan-stability files regenerated locally.
   - \`DecimalAggregatesBenchmark\` AVG cases (B2, B4) updated to \`pPrime = 
11\`
     (the new bound). Result files left stale relative to the code change --
     the \`benchmark.yml\` GHA workflow can regenerate them.
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to