Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22764#discussion_r226820605
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
    @@ -225,13 +227,14 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
           assert(formatVersion == thisFormatVersion)
           val rootId = (metadata \ "rootId").extract[Int]
           val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
    +      val trainingCost = (metadata \ "trainingCost").extract[Double]
    --- End diff --
    
    @WeichenXu123 I don't see (and think) this change breaks backwards 
compatibility for `mllib`. You can still load a model stored by previous 
versions, the only difference is that it won't have a valid `trainingCost` (but 
this just means we lack a new feature, all operations/data available before is 
still there, so backward compatibility is ensured). The same approach was used 
also for `KMeans`, so if we want to change approach here, we might have to do 
it also for `KMeans`.
    
    Anyway, regarding your suggestion, since the summary is not recomputed, 
there is no need at all to store this for `ml`: a loaded model doesn't contain 
the summary, so doesn't contain the `trainingCost` (same for `KMeans`). We 
could/should rather ignore at all storing the value, it changes nothing on `ml` 
side.
    
    On the other side, if we want it to be stored and re-loaded on `ml` side, 
we should start persisting the summary too there, but as I mentioned before, I 
think this is a different and bigger change, which should involve all the 
summary/models and not only `BisectingKMeans`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to