wilmerdooley opened a new pull request, #56621:
URL: https://github.com/apache/spark/pull/56621

   ## What changes and why
   
   `SizeInBytesOnlyStatsPlanVisitor` estimates a plan's `sizeInBytes` by 
multiplying the sizes of child plans (via `product`) and by multiplying the 
unary visit result by the number of projections in an `Expand`. When a plan 
repeatedly joins its own output, those multiplicative estimates can grow 
without bound and the underlying `BigInt` eventually overflows in 
`BigInteger.multiply`, throwing from `BigInteger.reportOverflow`.
   
   This change introduces a private `MAX_SIZE_IN_BYTES = BigInt(Long.MaxValue)` 
constant and applies `min(MAX_SIZE_IN_BYTES)` to every `sizeInBytes` produced 
in `SizeInBytesOnlyStatsPlanVisitor`: the `default` (multi-child) case, the 
`visitUnaryNode` helper, and `visitExpand`. The cap is far larger than any real 
dataset, so legitimate estimates are unaffected, but pathological self-join 
chains no longer overflow.
   
   ## Concrete changes
   
   - Add a private `MAX_SIZE_IN_BYTES` constant in 
`SizeInBytesOnlyStatsPlanVisitor` and document the reason for the cap.
   - Clamp the `sizeInBytes` produced by the `default` visitor (used for 
multi-child operators such as joins) with `.min(MAX_SIZE_IN_BYTES)`.
   - Clamp the `sizeInBytes` returned by the unary-node helper with 
`.min(MAX_SIZE_IN_BYTES)`.
   - Clamp the `sizeInBytes` returned by `visitExpand` with 
`.min(MAX_SIZE_IN_BYTES)`.
   - Add a regression test in `BasicStatsEstimationSuite` that drives 30 
self-joins of a leaf node whose size is already `Long.MaxValue` and asserts the 
joined and projected sizes stay at or below the cap.
   
   JIRA: https://issues.apache.org/jira/browse/SPARK-52163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to