liupc opened a new pull request #27968: [SPARK-31202][CORE]Improve SizeEstimator for AppendOnlyMap URL: https://github.com/apache/spark/pull/27968 ### What changes were proposed in this pull request? Currently, spark's memory management depends on the size estimation for execution and storage. In our real cluster, users always meet the issue OOM due to the inaccurate size estimation for ` AppendOnlyMap`, that's because spark stores KV in an Array[AnyRef] in `AppendOnlyMap` for memory locality, and this value can be CompactBuffer[_] or Array[CompactBuffer[_]] for transformation like cogroup/join/groupBy, but current `SizeEstimator` will still treat this special array as an normal array, so in many cases, we noticed a great bias between the estimated size and the acutal memory consuption. In this PR, I propose to improve the estimation for `AppendOnlyMap` when the value type is CompactBuffer/Array[CompactBuffer]. ### Why are the changes needed? Improvements and can avoid OOM for many cases. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT & Added UT
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
