Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/11119#discussion_r82215606
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
---
@@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") (
@Since("1.5.0")
def setSeed(value: Long): this.type = set(seed, value)
+ /** @group setParam */
+ @Since("2.1.0")
+ def setInitialModel(value: KMeansModel): this.type = set(initialModel,
value)
--- End diff --
There was some discussion on this in this PR (it was in March :). IF the
above is the desired behavior, we still need to check that `k` and the initial
model line up since you can set the initial model, and then set `k`. I tested
it and an error still gets thrown, but it's thrown by the mllib KMeans instead.
We should check it in ML explicitly. I prefer the following behavior:
* If `isSet(initialModel && isSet(k)` then check that they are equal at
train time and throw an error if not
* if `isSet(initialModel) && !isSet(k)` then set k to the initial model k
at train time (can log a warning maybe)
Actually, the current behavior is essentially equivalent. But, we still
need a test to check that an error is thrown when the two mismatch, and we need
to check that case inside of the train method still.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]