[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

BryanCutler Wed, 13 Dec 2017 14:09:06 -0800

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19904
  
    Thanks for looking into this @WeichenXu123, this does change the behavior 
in a couple ways though.  Like @sethah said, the unpersist of training data is 
not async anymore, but this also changes the order in which `fit` and 
`evaluate` are called so that training data is not unpersisted until all but 
the last models are also evaluated.  Before, all `modelFutures` would be 
executed first before `metricFutures` and so training data could be unpersisted 
as soon as possible.  I believe this is how it worked before adding the 
parallelism too.
    
    I did some local testing where I put `modelFutures` in an inner function so 
that they are out of scope before `awaitResult` is called, and also mapped the 
`Future.sequence` similar to 
https://github.com/apache/spark/pull/19904#discussion_r156751569, and this 
seemed to be enough to allow the models to be GC'd.  I think this approach 
would be a little better.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

Reply via email to