Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22764#discussion_r226668414
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
---
@@ -225,13 +227,14 @@ object BisectingKMeansModel extends
Loader[BisectingKMeansModel] {
assert(formatVersion == thisFormatVersion)
val rootId = (metadata \ "rootId").extract[Int]
val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
+ val trainingCost = (metadata \ "trainingCost").extract[Double]
--- End diff --
Thank you all for the comments and sorry for the late answer. Just a couple
of notes on your comments @WeichenXu123 (I may be missing something, so please
correct me if I am wrong):
- I checked the `ml.clustering.KMeansModel.KMeansModelReader` and it
doesn't store anything related to the summary. Summary is not recovered after
load of the model, so I don't see any reason why we should modify the read/load
of `ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader`;
- this model can read from previous versions, since this is version "2.0",
which was introduced for Spark 2.4; for previous versions, we read/write
version "1.0"; the version check method for versioning is used only for the
`ml` package, not in `mllib` where we have this versioning approach;
> I was told that this change just follows what we did for other models
before.
@cloud-fan Yes, let me link the PR for KMeans doing the same, which is:
https://github.com/apache/spark/pull/20629.
Just a final comment which I hope clarifies which is the source of the
confusion here and the reason of the above comments by @viirya and
@WeichenXu123: `trainingCost` here is a member of the `summary`, not a
parameter of the model for the `ml.clustering.BisectingKMeansModel`. Instead,
it is a member of the model for `mllib.clustering.BisectingKMeansModel` (we
have no summary notion there...). So storing it for `mllib` is needed in order
for the model read after persisting it being the same of the original one (I
think it doesn't pass UTs otherwise). Storing it for the `ml`, instead, it is
not needed, because the summary is not persisted. If we want to persist also
the summary for `ml` package I think we should best create a separate JIRA and
PR for it.
Hope this clarifies (sorry for being so verbose).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]