Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21322
according to `TorrentBroadcast.writeBlocks` and
`TorrentBroadcast.readBroadcastBlock`, broadcasted data are always written into
the block manager with `MEMORY_AND_DISK_SER` mode. When a task wants to get the
broadcasted value, it needs to get the serialized bytes from block
manager(maybe memory, maybe disk) and deserialize it into an object, put the
object into a cache, and return the object to the user function.
That said, at the executor side we may create the broadcasted object
multiple times: e.g. if it's evicted from cache and GCed, we read and
deserialize it from block manager again. A simple way to do executor side clean
up is implementing `finalize` method in the class you want to broadcast.
Alternatively, we can improve the cache and watch the eviction event. If
eviction happens, do the cleanup.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]