Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10539#issuecomment-169091413
@ankurdave Would you mind checking out this issue? It's really weird to
me. Here's what's happening:
* Background: PowerIterationClustering (PIC) initializes a ```Graph[Double,
Double]``` which represents a (symmetric) similarity matrix. It then runs
power iteration on the graph (i.e., on the matrix) to get an approximate first
eigenvector. It then clusters the vertices (based on their 1 Double value in
the eigenvector) using KMeans.
* On the surface: This PR changes KMeans in PIC to run 1 time instead of 5
times. That changes the results of PIC.
* Test failure: There are 2 comparable tests in
PowerIterationClusteringSuite: ```test("power iteration clustering")``` and
```test("power iteration clustering on graph")```. The second one is failing,
even though it seems to begin with exactly the same graph and values.
* If you look more closely, it seems to be an issue with partitioning:
* Same initialization: One test gives PIC a list of similarities (which
PIC converts to a graph), and the other gives PIC the graph. However, if you
print the graphs PIC begins with, they are exactly the same, except for the
partitioning.
* The PIC runs in the 2 tests diverge on the first iteration.
* The 2 "fixes" I've seen are:
* Changing ```TripletFields.Src``` to ```TripletFields.All``` somehow
fixes the problem, even though only the edge and src attributes are used.
(done in this PR)
* Changing the number of partitions from 2 to 1 also fixes the problem.
(tested locally)
Is it possible that GraphX is not shipping the correct data between
partitions? Thanks a lot in advance for your help!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]