[GitHub] spark pull request #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with be...

srowen Sat, 10 Sep 2016 06:13:12 -0700

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14956#discussion_r78277099

--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
---
@@ -395,7 +395,7 @@ object PowerIterationClustering extends Logging {
val points = v.mapValues(x => Vectors.dense(x)).cache()
val model = new KMeans()
.setK(k)
- .setSeed(0L)
+ .setSeed(5L)
--- End diff --

Sorry Matei I kind of missed your point. Yes it's a bit more strange to be
changing a seed in a non-test file. Same reasoning, but I agree. There's a seed
here to begin with for determinism but probably shouldn't matter.

I think I understand the problem, and I think it's the test. k-means is
used to cluster 1D data. The test case generates two concentric circles of
points of radius 1 and 4, which are intended to form k=2 separate clusters in
the derived values that are clustered by k-means internally. That's even clear
from looking at the similarities plotted:

![rplot](https://cloud.githubusercontent.com/assets/822522/18410522/2de05f76-775d-11e6-8e75-07d3d31e5cae.png)

While it's clear what the clustering is supposed to be, it's not actually
the lowest-cost k-means clustering. Many clusterings do find the 'wrong' better
clustering which is one that would include a few of the leftmost elements of
the right group into the left one. Many other clusterings get the 'right'
answer which is a big local minimum but not optimal. In fact, k-means|| init
seems to do worse than random here exactly because it's less likely to find the
local minimum.

I don't think the choice of radii matters here, since the resulting values
above are basically invariant.

I'm going to have to read the paper more to understand what the difference
is here. It's not quite the k-means change here, and, we can make this test
pass easily by either

- Set seed back to 0 and fix init steps = 5 for this use of k-means,
because that happens to work. Then this implementation doesn't change at all.
It means it does more work just to make the test pass.
- Set seed to, say, 5 to get this to pass, on the theory that the choice of
seed still doesn't seem to matter per se, and 5 is no worse than 0.

Obviously I want to understand a little more about how this is ever
supposed to work in PIC, though it ends up being a slightly different issue.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with be...

Reply via email to