srowen commented on a change in pull request #22764: [SPARK-25765][ML] Add
training cost to BisectingKMeans summary
URL: https://github.com/apache/spark/pull/22764#discussion_r243301364
##########
File path:
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
##########
@@ -195,7 +200,7 @@ object BisectingKMeansModel extends
Loader[BisectingKMeansModel] {
val data = rows.select("index", "size", "center", "norm", "cost",
"height", "children")
val nodes = data.rdd.map(Data.apply).collect().map(d => (d.index,
d)).toMap
val rootNode = buildTree(rootId, nodes)
- new BisectingKMeansModel(rootNode, DistanceMeasure.EUCLIDEAN)
+ new BisectingKMeansModel(rootNode, DistanceMeasure.EUCLIDEAN, 0.0)
Review comment:
Would it not just be the same? `rootNode.leafNodes.map(_.cost).sum`? If that
cost info is present in the nodes (?) it doesn't need a pass over data (which
indeed doesn't exist at this point). If it's valuable enough to include at all,
should this info not be correct where it is in fact available?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]