GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/21502

    [SPARK-22575][SQL] Add destroy to Dataset

    ## What changes were proposed in this pull request?
    
    In the Dataset API we may acquire resources which we cannot deallocate. 
This happens for broadcast joins. The broadcasted object is never destroyed and 
we rely on the garbage collection of broadcasted object to free it. In a 
general use case, this is a safe assumption, but when dynamic allocation is 
enabled, the current approach can lead to resource leakage.
    
    In particular, when a Spark application is submitted on YARN with dynamic 
allocation enabled, we may leak disk space. Indeed, in such a scenario, when 
query with a broadcast join is executed, it is likely that we ask for new 
containers. These containers are used for the execution of the query and then 
killed. They may be killed before the broadcast object is GCed. In this case, 
the files which have been written are never removed (as the container is not 
alive anymore to remove them and YARN removes them only when the application 
ends).
    
    In order to solve the above-mentioned issue, the PR proposes to add a 
`destroy` method to the `Dataset` class, which can be used to free all the 
resources which have been acquired in the plan execution. Eagerly destroying 
the acquired resources, they are freed before the containers are killed, 
avoiding (or at least reducing considerably) the problem.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-22575

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21502.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21502
    
----
commit 147bd08db09fe328de12069c9c0d8a849d99adf4
Author: Marco Gaido <marcogaido91@...>
Date:   2018-01-31T16:35:37Z

    [SPARK-22575][SQL] Add destroy to Dataset

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to