You are right, the code would throw an exception here, but I guess it would be sufficient to simply add a check here and not send any messages.
--sebastian On 17.05.2012 23:44, Paolo Castagna wrote: > Hi Sebastian, > from SimplePageRankVertex.java, we have: > > if (getSuperstep() < MAX_SUPERSTEPS) { > long edges = getNumOutEdges(); > sendMsgToAllEdges( > new DoubleWritable(getVertexValue().get() / edges)); > } else { > voteToHalt(); > } > > What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? > > Paolo > > Sebastian Schelter wrote: >> The implementation implicitly deals with dangling nodes by computing the >> pageRank as >> >> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >> >> Conceptually, this is the same as connecting every vertex with every >> other vertex it is not yet connected to. This removes all dangling nodes >> from the graph as every vertex can reach every other vertex. >> >> "Mining of Massive Datasets" [1] has a very well written explanation of >> this approach. >> >> --sebastian >> >> >> [1] http://infolab.stanford.edu/~ullman/mmds.html >> >> On 17.05.2012 16:23, Paolo Castagna wrote: >>> Hi, >>> I noticed that the SimplePageRankVertex implementation does not deal with >>> dangling nodes (dangling nodes are those nodes with no outgoing links). >>> And, probably that is one of the good reasons why there is a "Simple" in >>> front. ;-) >>> >>> "Dangling nodes exist in many forms. For example, a page of data, a page >>> with a >>> postscript graph, [...], a page that has been fetched by a crawler but not >>> yet >>> explored - these are all examples of possible dangling nodes. The more >>> ambitious >>> the crawl, the bigger the proportion of dangling nodes because the set of >>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>> >>> Wikipedia's PageRank page says: >>> >>> "If a page has no links to other pages, it becomes a sink and therefore >>> terminates the random surfing process. If the random surfer arrives at a >>> sink >>> page, it picks another URL at random and continues surfing again. When >>> calculating PageRank, pages with no outbound links are assumed to link out >>> to >>> all other pages in the collection. Their PageRank scores are therefore >>> divided >>> evenly among all other pages. In other words, to be fair with pages that >>> are not >>> sinks, these random transitions are added to all nodes in the Web, with a >>> residual probability usually set to d = 0.85, estimated from the frequency >>> that >>> an average surfer uses his or her browser's bookmark feature." [2] >>> >>> Now, if I want to try to account for the dangling nodes I need to sum all >>> PageRank values for the dangling nodes and divide it evenly among all other >>> pages (which in practce means to send a message to all vertexes in Giraph... >>> which is not possible, as it would be crazy with large graphs). >>> >>> I imagine one can use a SumAggregator and alternate computing contribution >>> from >>> dangling nodes and PageRank values (i.e. PageRank values are computed only >>> every >>> 2 supersteps). >>> >>> What do you think? >>> >>> Paolo >>> >>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The >>> Science of Search Engine Rankings" (2006) >>> [2] http://en.wikipedia.org/wiki/PageRank >>