Could you just accumulate the PR of dangling nodes in a variable and divide it equally in the next iteration? If you use a preference vector instead of dividing it equally, then you have also the strongly preferential (and weakly preferential in case you divide equally) versions of RWR.

Cheers, -- Gianmarco On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna < castagna.li...@googlemail.com> wrote: > Sebastian Schelter wrote: > > You are right, the code would throw an exception here, but I guess it > > would be sufficient to simply add a check here and not send any messages. > > That would avoid throwing an exception, but I am not sure it is actually > the best thing to do. > Some suggest to divide the PageRank scores of dangling nodes evenly among > all other pages. > If you want, this is equivalent to another random jump, but forced by the > fact that there are no outgoing links to select from. > Using your notation below: > > 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + > sumOfDanlingNodesRanks / getNumVertices() ) > > See also pag. 38 of the book ""Google's PageRank and Beyond": > http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 > > In other works, in pseudo code a serial implementation which accounts for > dangling nodes should be: > > -------- > dangling_nodes_ranks = 0 > foreach node in dangling_node > dangling_nodes_ranks += ranks.get ( node ) > dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; > > foeach node1 in graph > r = 0 > foreach node2 in incoming_links ( node1 ) > r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) > > r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N > -------- > > What do you think? > > Paolo > > > > > --sebastian > > > > > > On 17.05.2012 23:44, Paolo Castagna wrote: > >> Hi Sebastian, > >> from SimplePageRankVertex.java, we have: > >> > >> if (getSuperstep() < MAX_SUPERSTEPS) { > >> long edges = getNumOutEdges(); > >> sendMsgToAllEdges( > >> new DoubleWritable(getVertexValue().get() / edges)); > >> } else { > >> voteToHalt(); > >> } > >> > >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? > >> > >> Paolo > >> > >> Sebastian Schelter wrote: > >>> The implementation implicitly deals with dangling nodes by computing > the > >>> pageRank as > >>> > >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks > >>> > >>> Conceptually, this is the same as connecting every vertex with every > >>> other vertex it is not yet connected to. This removes all dangling > nodes > >>> from the graph as every vertex can reach every other vertex. > >>> > >>> "Mining of Massive Datasets" [1] has a very well written explanation of > >>> this approach. > >>> > >>> --sebastian > >>> > >>> > >>> [1] http://infolab.stanford.edu/~ullman/mmds.html > >>> > >>> On 17.05.2012 16:23, Paolo Castagna wrote: > >>>> Hi, > >>>> I noticed that the SimplePageRankVertex implementation does not deal > with > >>>> dangling nodes (dangling nodes are those nodes with no outgoing > links). > >>>> And, probably that is one of the good reasons why there is a "Simple" > in front. ;-) > >>>> > >>>> "Dangling nodes exist in many forms. For example, a page of data, a > page with a > >>>> postscript graph, [...], a page that has been fetched by a crawler > but not yet > >>>> explored - these are all examples of possible dangling nodes. The > more ambitious > >>>> the crawl, the bigger the proportion of dangling nodes because the > set of > >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] > >>>> > >>>> Wikipedia's PageRank page says: > >>>> > >>>> "If a page has no links to other pages, it becomes a sink and > therefore > >>>> terminates the random surfing process. If the random surfer arrives > at a sink > >>>> page, it picks another URL at random and continues surfing again. When > >>>> calculating PageRank, pages with no outbound links are assumed to > link out to > >>>> all other pages in the collection. Their PageRank scores are > therefore divided > >>>> evenly among all other pages. In other words, to be fair with pages > that are not > >>>> sinks, these random transitions are added to all nodes in the Web, > with a > >>>> residual probability usually set to d = 0.85, estimated from the > frequency that > >>>> an average surfer uses his or her browser's bookmark feature." [2] > >>>> > >>>> Now, if I want to try to account for the dangling nodes I need to sum > all > >>>> PageRank values for the dangling nodes and divide it evenly among all > other > >>>> pages (which in practce means to send a message to all vertexes in > Giraph... > >>>> which is not possible, as it would be crazy with large graphs). > >>>> > >>>> I imagine one can use a SumAggregator and alternate computing > contribution from > >>>> dangling nodes and PageRank values (i.e. PageRank values are computed > only every > >>>> 2 supersteps). > >>>> > >>>> What do you think? > >>>> > >>>> Paolo > >>>> > >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and > Beyond: The > >>>> Science of Search Engine Rankings" (2006) > >>>> [2] http://en.wikipedia.org/wiki/PageRank > > >