GitHub user aray opened a pull request:

    https://github.com/apache/spark/pull/16271

    [SPARK-18845][GraphX] PageRank has incorrect initialization value that 
leads to slow convergence

    ## What changes were proposed in this pull request?
    
    Change the initial value in all PageRank implementations to be `1.0` 
instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of 
`joinVertices` so that source vertices get updated in each iteration. 
    
    This seems to have been introduced a long time ago in 
https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90
    
    With the exception of graphs with sinks (which currently give incorrect 
results see SPARK-18847) this gives faster convergence as the sum of ranks is 
already correct (sum or ranks should be number of vertices).
    
    Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
    Code for benchmark: 
https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd
    
    ## How was this patch tested?
    
    (corrected) existing unit tests and additional test that verifies against 
result of igraph and NetworkX on a loop with a source.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aray/spark pagerank-initial-value

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16271.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16271
    
----
commit 9149ca233722e3d70aefb223dfd6a16ee8dbf924
Author: Andrew Ray <[email protected]>
Date:   2016-12-12T18:34:49Z

    fix

commit b145376d88b6f5e58e2b9d051d9c268a36b9f939
Author: Andrew Ray <[email protected]>
Date:   2016-12-13T15:51:06Z

    fix initial value for grid graph independent calculation

commit d39d2f07ab1a1aadb24dbd67bbbe37400beaadb4
Author: Andrew Ray <[email protected]>
Date:   2016-12-13T16:25:15Z

    use outer join so that sources are updated and fix reset probability for 
personalized

commit 7ea03a88a3d9caa0ab7a7e6e681b8bf00b5cc128
Author: Andrew Ray <[email protected]>
Date:   2016-12-13T16:36:10Z

    fix star page rank test to account for sources getting updated in the first 
iteration which then changes the center in the second iteration

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to