wilmerdooley opened a new pull request, #56621: URL: https://github.com/apache/spark/pull/56621
## What changes and why `SizeInBytesOnlyStatsPlanVisitor` estimates a plan's `sizeInBytes` by multiplying the sizes of child plans (via `product`) and by multiplying the unary visit result by the number of projections in an `Expand`. When a plan repeatedly joins its own output, those multiplicative estimates can grow without bound and the underlying `BigInt` eventually overflows in `BigInteger.multiply`, throwing from `BigInteger.reportOverflow`. This change introduces a private `MAX_SIZE_IN_BYTES = BigInt(Long.MaxValue)` constant and applies `min(MAX_SIZE_IN_BYTES)` to every `sizeInBytes` produced in `SizeInBytesOnlyStatsPlanVisitor`: the `default` (multi-child) case, the `visitUnaryNode` helper, and `visitExpand`. The cap is far larger than any real dataset, so legitimate estimates are unaffected, but pathological self-join chains no longer overflow. ## Concrete changes - Add a private `MAX_SIZE_IN_BYTES` constant in `SizeInBytesOnlyStatsPlanVisitor` and document the reason for the cap. - Clamp the `sizeInBytes` produced by the `default` visitor (used for multi-child operators such as joins) with `.min(MAX_SIZE_IN_BYTES)`. - Clamp the `sizeInBytes` returned by the unary-node helper with `.min(MAX_SIZE_IN_BYTES)`. - Clamp the `sizeInBytes` returned by `visitExpand` with `.min(MAX_SIZE_IN_BYTES)`. - Add a regression test in `BasicStatsEstimationSuite` that drives 30 self-joins of a leaf node whose size is already `Long.MaxValue` and asserts the joined and projected sizes stay at or below the cap. JIRA: https://issues.apache.org/jira/browse/SPARK-52163 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
