Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14937
@yanboliang here are a few other changes I made in my PR that accidentally
duplicated some of this work. Refer to
https://github.com/apache/spark/pull/14948 for details. For your consideration:
I think getRuns/setRuns should be formally deprecated and the runs param to
the constructor removed (it's private).
There are some mentions of 'runs' in the docs that should be removed too at
this point.
mergeContribs and the "type WeightedPoint" don't really serve a purpose at
this point and can be 'inlined' IMHO.
Minor: the "contribs.iterator" can really be an iterator only over triples
with non-zero counts, which eliminates the filtering by 0 counts
The "run finished" log message is obsolete now.
Minor, but in k-means|| the sample of 1 element is very slightly better if
it's without replacement. Won't matter much but otherwise you might sample a
couple elements.
pointsWithCosts.flatMap might be a little faster as filter + map instead
because virtually every element is filtered out.
mergeNewCenters() is pretty superfluous, because it's simpler to compute
newCenters, then add it to centers, in the same loop. No clear() or multiple
calls to update this.
weightMap can be computed with countByValue directly
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]