[
https://issues.apache.org/jira/browse/SPARK-40662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emil Ejbyfeldt resolved SPARK-40662.
------------------------------------
Resolution: Invalid
The increase was caused by change in hashCode between 2.12 and 2.13 and when
reading using a different scala version caused there to be a lot more non empty
blocks to fetch.
> Serialization of MapStatuses is somtimes much larger on scala 2.13
> ------------------------------------------------------------------
>
> Key: SPARK-40662
> URL: https://issues.apache.org/jira/browse/SPARK-40662
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.3.0
> Reporter: Emil Ejbyfeldt
> Priority: Major
>
> We have observed a case where the same job run against spark on scala 2.13
> fails going out of memory due to the the broadcast for the MapStatuses being
> huge.
> In the logs around the time the job fails it tries to create a broadcast of
> size 4.8GiB.
> ```
> 2022-09-18 22:46:01,418 INFO memory.MemoryStore: Block broadcast_17 stored as
> values in memory (estimated size 4.8 GiB, free 12.9 GiB)
> ```
> The same broadcast of the MapStatus for the same job running on 2.12 is 391.5
> Mib so
> ```
> 2022-09-18 16:11:58,753 INFO memory.MemoryStore: Block broadcast_17 stored as
> values in memory (estimated size 391.5 MiB, free 26.4 GiB)
> ```
> in this particular case it seems the broadcast for MapStatuses more than 10
> large when using 2.13. This is not something universal for all MapStatus
> broadcast as we have have many other jobs using Scala 2.13 where the status
> is ruffly the same size.
> This has been observed on 3.3.0 but I also tested it against 3.3.1-rc2 and
> build of 3.4.0-SNAPSHOT and both of those also reproduced the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]