Hi Sebastian, from SimplePageRankVertex.java, we have: if (getSuperstep() < MAX_SUPERSTEPS) { long edges = getNumOutEdges(); sendMsgToAllEdges( new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); }
What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? Paolo Sebastian Schelter wrote: > The implementation implicitly deals with dangling nodes by computing the > pageRank as > > 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks > > Conceptually, this is the same as connecting every vertex with every > other vertex it is not yet connected to. This removes all dangling nodes > from the graph as every vertex can reach every other vertex. > > "Mining of Massive Datasets" [1] has a very well written explanation of > this approach. > > --sebastian > > > [1] http://infolab.stanford.edu/~ullman/mmds.html > > On 17.05.2012 16:23, Paolo Castagna wrote: >> Hi, >> I noticed that the SimplePageRankVertex implementation does not deal with >> dangling nodes (dangling nodes are those nodes with no outgoing links). >> And, probably that is one of the good reasons why there is a "Simple" in >> front. ;-) >> >> "Dangling nodes exist in many forms. For example, a page of data, a page >> with a >> postscript graph, [...], a page that has been fetched by a crawler but not >> yet >> explored - these are all examples of possible dangling nodes. The more >> ambitious >> the crawl, the bigger the proportion of dangling nodes because the set of >> fetched but uncrawled pages grows quickly." (pag. 80) [1] >> >> Wikipedia's PageRank page says: >> >> "If a page has no links to other pages, it becomes a sink and therefore >> terminates the random surfing process. If the random surfer arrives at a sink >> page, it picks another URL at random and continues surfing again. When >> calculating PageRank, pages with no outbound links are assumed to link out to >> all other pages in the collection. Their PageRank scores are therefore >> divided >> evenly among all other pages. In other words, to be fair with pages that are >> not >> sinks, these random transitions are added to all nodes in the Web, with a >> residual probability usually set to d = 0.85, estimated from the frequency >> that >> an average surfer uses his or her browser's bookmark feature." [2] >> >> Now, if I want to try to account for the dangling nodes I need to sum all >> PageRank values for the dangling nodes and divide it evenly among all other >> pages (which in practce means to send a message to all vertexes in Giraph... >> which is not possible, as it would be crazy with large graphs). >> >> I imagine one can use a SumAggregator and alternate computing contribution >> from >> dangling nodes and PageRank values (i.e. PageRank values are computed only >> every >> 2 supersteps). >> >> What do you think? >> >> Paolo >> >> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The >> Science of Search Engine Rankings" (2006) >> [2] http://en.wikipedia.org/wiki/PageRank >