Ngone51 opened a new pull request #25262: [SPARK-28486][CORE][PYTHON] Map 
PythonBroadcast's data file to a BroadcastBlock to avoid delete by GC
URL: https://github.com/apache/spark/pull/25262
 
 
   ## What changes were proposed in this pull request?
   
   Currently, PythonBroadcast may delete its data file while a python worker 
still needs it. This happens because PythonBroadcast overrides the `finalize()` 
method to delete its data file. So, when GC happens and no  references on 
broadcast variable, it may trigger `finalize()` to delete 
   data file. That's also means, data under python Broadcast variable couldn't 
be deleted when `unpersist()`/`destroy()` called but relys on GC.
   
   In this PR, we removed the `finalize()` method, and map the PythonBroadcast 
data file to a BroadcastBlock(which has the same broadcast id with the 
broadcast variable who wrapped this PythonBroadcast) when PythonBroadcast is 
deserializing. As a result, the data file could be deleted just like other 
pieces of the Broadcast variable when `unpersist()`/`destroy()` called and do 
not rely on GC any more.
   
   ## How was this patch tested?
    
   Added a Python test, and tested manually(verified create/delete the 
broadcast block).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to