GitHub user aray opened a pull request:
https://github.com/apache/spark/pull/16271
[SPARK-18845][GraphX] PageRank has incorrect initialization value that
leads to slow convergence
## What changes were proposed in this pull request?
Change the initial value in all PageRank implementations to be `1.0`
instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of
`joinVertices` so that source vertices get updated in each iteration.
This seems to have been introduced a long time ago in
https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90
With the exception of graphs with sinks (which currently give incorrect
results see SPARK-18847) this gives faster convergence as the sum of ranks is
already correct (sum or ranks should be number of vertices).
Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
Code for benchmark:
https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd
## How was this patch tested?
(corrected) existing unit tests and additional test that verifies against
result of igraph and NetworkX on a loop with a source.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aray/spark pagerank-initial-value
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16271.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16271
----
commit 9149ca233722e3d70aefb223dfd6a16ee8dbf924
Author: Andrew Ray <[email protected]>
Date: 2016-12-12T18:34:49Z
fix
commit b145376d88b6f5e58e2b9d051d9c268a36b9f939
Author: Andrew Ray <[email protected]>
Date: 2016-12-13T15:51:06Z
fix initial value for grid graph independent calculation
commit d39d2f07ab1a1aadb24dbd67bbbe37400beaadb4
Author: Andrew Ray <[email protected]>
Date: 2016-12-13T16:25:15Z
use outer join so that sources are updated and fix reset probability for
personalized
commit 7ea03a88a3d9caa0ab7a7e6e681b8bf00b5cc128
Author: Andrew Ray <[email protected]>
Date: 2016-12-13T16:36:10Z
fix star page rank test to account for sources getting updated in the first
iteration which then changes the center in the second iteration
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]