Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/19904
Thanks for looking into this @WeichenXu123, this does change the behavior
in a couple ways though. Like @sethah said, the unpersist of training data is
not async anymore, but this also changes the order in which `fit` and
`evaluate` are called so that training data is not unpersisted until all but
the last models are also evaluated. Before, all `modelFutures` would be
executed first before `metricFutures` and so training data could be unpersisted
as soon as possible. I believe this is how it worked before adding the
parallelism too.
I did some local testing where I put `modelFutures` in an inner function so
that they are out of scope before `awaitResult` is called, and also mapped the
`Future.sequence` similar to
https://github.com/apache/spark/pull/19904#discussion_r156751569, and this
seemed to be enough to allow the models to be GC'd. I think this approach
would be a little better.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]