Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19208#discussion_r148701451
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
    @@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
         instr.logParams(numFolds, seed, parallelism)
         logTuningParams(instr)
     
    +    val collectSubModelsParam = $(collectSubModels)
    +
    +    var subModels: Option[Array[Array[Model[_]]]] = if 
(collectSubModelsParam) {
    --- End diff --
    
    @holdenk @jkbradley I already thought about this issue. The reason I use 
this way is:
    1) When `$(collectSubModels) == false`, the `modelFutures` and 
`foldMetricFutures` will be executed in pipelined way, this will make sure that 
the `model` generated in `modelFutures` will be released in time, so that the 
maximum memory cost will be `numParallelism * sizeof(model)`.  If we use the 
way of "collecting modelFutures", it will increase the memory cost to be 
`$(estimatorParamMaps).length * sizeof(model)` . This is a serious issue which 
is discussed before.
    2) IMO the mutation on L145 won't influence performance. and it do not need 
something like lock, there is no race condition.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to