Emil Ejbyfeldt created SPARK-40662:
--------------------------------------

             Summary: Serialization of MapStatuses is somtimes much larger on 
scala 2.13
                 Key: SPARK-40662
                 URL: https://issues.apache.org/jira/browse/SPARK-40662
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.3.0
            Reporter: Emil Ejbyfeldt


We have observed a case where the same job run against spark on scala 2.13 
fails going out of memory due to the the broadcast for the MapStatuses being 
huge.

In the logs around the time the job fails it tries to create a broadcast of 
size 4.8GiB. 
```
2022-09-18 22:46:01,418 INFO memory.MemoryStore: Block broadcast_17 stored as 
values in memory (estimated size 4.8 GiB, free 12.9 GiB)
```

The same broadcast of the MapStatus for the same job running on 2.12 is 391.5 
Mib so 
```
2022-09-18 16:11:58,753 INFO memory.MemoryStore: Block broadcast_17 stored as 
values in memory (estimated size 391.5 MiB, free 26.4 GiB)
```

in this particular case it seems the broadcast for MapStatuses more than 10 
large when using 2.13. This is not something universal for all MapStatus 
broadcast as we have have many other jobs using Scala 2.13 where the status is 
ruffly the same size. 

This has been observed on 3.3.0 but I also tested it against 3.3.1-rc2 and 
build of 3.4.0-SNAPSHOT and both of those also reproduced the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to