Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21322
  
    according to `TorrentBroadcast.writeBlocks` and 
`TorrentBroadcast.readBroadcastBlock`, broadcasted data are always written into 
the block manager with `MEMORY_AND_DISK_SER` mode. When a task wants to get the 
broadcasted value, it needs to get the serialized bytes from block 
manager(maybe memory, maybe disk) and deserialize it into an object, put the 
object into a cache, and return the object to the user function.
    
    That said, at the executor side we may create the broadcasted object 
multiple times: e.g. if it's evicted from cache and GCed, we read and 
deserialize it from block manager again. A simple way to do executor side clean 
up is implementing `finalize` method in the class you want to broadcast.
    
    Alternatively, we can improve the cache and watch the eviction event. If 
eviction happens, do the cleanup.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to