[GitHub] spark pull request #22764: [SPARK-25765][ML] Add training cost to BisectingK...

mgaido91 Fri, 19 Oct 2018 07:31:22 -0700

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22764#discussion_r226668414
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
    @@ -225,13 +227,14 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
           assert(formatVersion == thisFormatVersion)
           val rootId = (metadata \ "rootId").extract[Int]
           val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
    +      val trainingCost = (metadata \ "trainingCost").extract[Double]
    --- End diff --
    
    Thank you all for the comments and sorry for the late answer. Just a couple 
of notes on your comments @WeichenXu123  (I may be missing something, so please 
correct me if I am wrong):
    
     - I checked the `ml.clustering.KMeansModel.KMeansModelReader` and it 
doesn't store anything related to the summary. Summary is not recovered after 
load of the model, so I don't see any reason why we should modify the read/load 
of `ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader`;
     - this model can read from previous versions, since this is version "2.0", 
which was introduced for Spark 2.4; for previous versions, we read/write 
version "1.0"; the version check method for versioning is used only for the 
`ml` package, not in `mllib` where we have this versioning approach;
    
    > I was told that this change just follows what we did for other models 
before.
    
    @cloud-fan Yes, let me link the PR for KMeans doing the same, which is: 
https://github.com/apache/spark/pull/20629.
    
    Just a final comment which I hope clarifies which is the source of the 
confusion here and the reason of the above comments by @viirya and 
@WeichenXu123: `trainingCost` here is a member of the `summary`, not a 
parameter of the model for the `ml.clustering.BisectingKMeansModel`. Instead, 
it is a member of the model for `mllib.clustering.BisectingKMeansModel` (we 
have no summary notion there...). So storing it for `mllib` is needed in order 
for the model read after persisting it being the same of the original one (I 
think it doesn't pass UTs otherwise). Storing it for the `ml`, instead, it is 
not needed, because the summary is not persisted. If we want to persist also 
the summary for `ml` package I think we should best create a separate JIRA and 
PR for it.
    
    Hope this clarifies (sorry for being so verbose).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22764: [SPARK-25765][ML] Add training cost to BisectingK...

Reply via email to