GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/19904

    [SPARK-22707][ML] Optimize CrossValidator fitting memory occupation by 
models

    ## What changes were proposed in this pull request?
    
    Via some test I found CrossValidator still exists memory issue, it will 
still occupy `O(n*sizeof(model))` memory for holding models when fitting, if 
well optimized, it should be `O(parallelism*sizeof(model))`
    
    This is because modelFutures will hold the reference to model object after 
future is complete (we can use `future.value.get.get` to fetch it), and the 
`Future.sequence` and the `modelFutures` array holds references to each model 
future. So all model object are keep referenced until `fit` return. So it will 
still occupy `O(n*sizeof(model))` memory.
    
    I fix this by merging the `modelFuture` and `foldMetricFuture` together, 
and via `wait/notify` to unpersist training dataset in time.
    
    ## How was this patch tested?
    
    N/A


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark 
fix_cross_validator_memory_issue

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19904.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19904
    
----
commit 7725fd8a86dddba6c61c7d053dfa510a114bebb8
Author: WeichenXu <[email protected]>
Date:   2017-12-05T11:45:42Z

    init pr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to