[PR] [SPARK-57199][SQL] Extract the aggregate out-of-memory error into a QueryExecutionErrors factory [spark]

via GitHub Mon, 01 Jun 2026 15:25:46 -0700


gengliangwang opened a new pull request, #56256:
URL: https://github.com/apache/spark/pull/56256


   ### What changes were proposed in this pull request?
   
   The aggregate out-of-memory error (`AGGREGATE_OUT_OF_MEMORY`) is constructed 
inline in two places:
   
   - `HashAggregateExec`, whose whole-stage codegen emits `throw new 
<SparkOutOfMemoryError>("AGGREGATE_OUT_OF_MEMORY", new java.util.HashMap());` 
into every generated aggregate class.
   - `TungstenAggregationIterator` (the interpreted fallback), which throws the 
same `new SparkOutOfMemoryError(...)` and needs a `// scalastyle:off 
throwerror` suppression.
   
   This PR adds a `QueryExecutionErrors.aggregateOutOfMemoryError()` factory 
(next to the existing `cannotAcquireMemory*` OOM factories) and routes both 
call sites through it. In the codegen path the emitted Java becomes `throw 
QueryExecutionErrors.aggregateOutOfMemoryError();`.
   
   ### Why are the changes needed?
   
   Sub-task of SPARK-56908 (reduce generated Java size in whole-stage codegen). 
Dumping the whole-stage codegen of the TPC-DS queries shows the inline `throw 
new org.apache.spark.memory.SparkOutOfMemoryError("AGGREGATE_OUT_OF_MEMORY", 
new java.util.HashMap());` line **445 times** across 142 of 150 generated 
classes -- it is the single most-repeated `throw` in the corpus. Replacing it 
with a factory call shrinks each generated aggregate class and moves the 
error-class string and the empty message-parameter map out of every generated 
class's constant pool into one compiled method. It also consolidates the error 
construction shared with the interpreted path and removes the `throwerror` 
scalastyle suppression there.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The same `AGGREGATE_OUT_OF_MEMORY` error with the same (empty) message 
parameters is thrown; only where it is constructed changes.
   
   ### How was this patch tested?
   
   This is a behavior-preserving refactor, covered by the existing aggregate 
suites (e.g. `DataFrameAggregateSuite`, 163 tests, pass). The change was 
additionally verified by re-dumping the TPC-DS whole-stage codegen: all 445 
inline throws are now `QueryExecutionErrors.aggregateOutOfMemoryError()` calls, 
and every generated subtree still compiles (the Janino default imports already 
make `QueryExecutionErrors` available unqualified, as used by other generated 
error calls such as `divideByZeroError`). This mirrors the sibling 
`DateTimeExpressionUtils` codegen extractions, which likewise relied on 
existing expression-suite coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57199][SQL] Extract the aggregate out-of-memory error into a QueryExecutionErrors factory [spark]

Reply via email to