The implementation implicitly deals with dangling nodes by computing the pageRank as

## Advertising

0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks Conceptually, this is the same as connecting every vertex with every other vertex it is not yet connected to. This removes all dangling nodes from the graph as every vertex can reach every other vertex. "Mining of Massive Datasets" [1] has a very well written explanation of this approach. --sebastian [1] http://infolab.stanford.edu/~ullman/mmds.html On 17.05.2012 16:23, Paolo Castagna wrote: > Hi, > I noticed that the SimplePageRankVertex implementation does not deal with > dangling nodes (dangling nodes are those nodes with no outgoing links). > And, probably that is one of the good reasons why there is a "Simple" in > front. ;-) > > "Dangling nodes exist in many forms. For example, a page of data, a page > with a > postscript graph, [...], a page that has been fetched by a crawler but not yet > explored - these are all examples of possible dangling nodes. The more > ambitious > the crawl, the bigger the proportion of dangling nodes because the set of > fetched but uncrawled pages grows quickly." (pag. 80) [1] > > Wikipedia's PageRank page says: > > "If a page has no links to other pages, it becomes a sink and therefore > terminates the random surfing process. If the random surfer arrives at a sink > page, it picks another URL at random and continues surfing again. When > calculating PageRank, pages with no outbound links are assumed to link out to > all other pages in the collection. Their PageRank scores are therefore divided > evenly among all other pages. In other words, to be fair with pages that are > not > sinks, these random transitions are added to all nodes in the Web, with a > residual probability usually set to d = 0.85, estimated from the frequency > that > an average surfer uses his or her browser's bookmark feature." [2] > > Now, if I want to try to account for the dangling nodes I need to sum all > PageRank values for the dangling nodes and divide it evenly among all other > pages (which in practce means to send a message to all vertexes in Giraph... > which is not possible, as it would be crazy with large graphs). > > I imagine one can use a SumAggregator and alternate computing contribution > from > dangling nodes and PageRank values (i.e. PageRank values are computed only > every > 2 supersteps). > > What do you think? > > Paolo > > [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The > Science of Search Engine Rankings" (2006) > [2] http://en.wikipedia.org/wiki/PageRank