Accumulation should be possible with aggregators and the idea with applying preference vectors seems very interesting too.
Maybe these should be implemented as additional features in GIRAPH-191? On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote: > Could you just accumulate the PR of dangling nodes in a variable and divide > it equally in the next iteration? > If you use a preference vector instead of dividing it equally, then you > have also the strongly preferential (and weakly preferential in case you > divide equally) versions of RWR. > > Cheers, > -- > Gianmarco > > > > > On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna < > castagna.li...@googlemail.com> wrote: > >> Sebastian Schelter wrote: >>> You are right, the code would throw an exception here, but I guess it >>> would be sufficient to simply add a check here and not send any messages. >> >> That would avoid throwing an exception, but I am not sure it is actually >> the best thing to do. >> Some suggest to divide the PageRank scores of dangling nodes evenly among >> all other pages. >> If you want, this is equivalent to another random jump, but forced by the >> fact that there are no outgoing links to select from. >> Using your notation below: >> >> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + >> sumOfDanlingNodesRanks / getNumVertices() ) >> >> See also pag. 38 of the book ""Google's PageRank and Beyond": >> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 >> >> In other works, in pseudo code a serial implementation which accounts for >> dangling nodes should be: >> >> -------- >> dangling_nodes_ranks = 0 >> foreach node in dangling_node >> dangling_nodes_ranks += ranks.get ( node ) >> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; >> >> foeach node1 in graph >> r = 0 >> foreach node2 in incoming_links ( node1 ) >> r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) >> >> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N >> -------- >> >> What do you think? >> >> Paolo >> >>> >>> --sebastian >>> >>> >>> On 17.05.2012 23:44, Paolo Castagna wrote: >>>> Hi Sebastian, >>>> from SimplePageRankVertex.java, we have: >>>> >>>> if (getSuperstep() < MAX_SUPERSTEPS) { >>>> long edges = getNumOutEdges(); >>>> sendMsgToAllEdges( >>>> new DoubleWritable(getVertexValue().get() / edges)); >>>> } else { >>>> voteToHalt(); >>>> } >>>> >>>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? >>>> >>>> Paolo >>>> >>>> Sebastian Schelter wrote: >>>>> The implementation implicitly deals with dangling nodes by computing >> the >>>>> pageRank as >>>>> >>>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >>>>> >>>>> Conceptually, this is the same as connecting every vertex with every >>>>> other vertex it is not yet connected to. This removes all dangling >> nodes >>>>> from the graph as every vertex can reach every other vertex. >>>>> >>>>> "Mining of Massive Datasets" [1] has a very well written explanation of >>>>> this approach. >>>>> >>>>> --sebastian >>>>> >>>>> >>>>> [1] http://infolab.stanford.edu/~ullman/mmds.html >>>>> >>>>> On 17.05.2012 16:23, Paolo Castagna wrote: >>>>>> Hi, >>>>>> I noticed that the SimplePageRankVertex implementation does not deal >> with >>>>>> dangling nodes (dangling nodes are those nodes with no outgoing >> links). >>>>>> And, probably that is one of the good reasons why there is a "Simple" >> in front. ;-) >>>>>> >>>>>> "Dangling nodes exist in many forms. For example, a page of data, a >> page with a >>>>>> postscript graph, [...], a page that has been fetched by a crawler >> but not yet >>>>>> explored - these are all examples of possible dangling nodes. The >> more ambitious >>>>>> the crawl, the bigger the proportion of dangling nodes because the >> set of >>>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>>>>> >>>>>> Wikipedia's PageRank page says: >>>>>> >>>>>> "If a page has no links to other pages, it becomes a sink and >> therefore >>>>>> terminates the random surfing process. If the random surfer arrives >> at a sink >>>>>> page, it picks another URL at random and continues surfing again. When >>>>>> calculating PageRank, pages with no outbound links are assumed to >> link out to >>>>>> all other pages in the collection. Their PageRank scores are >> therefore divided >>>>>> evenly among all other pages. In other words, to be fair with pages >> that are not >>>>>> sinks, these random transitions are added to all nodes in the Web, >> with a >>>>>> residual probability usually set to d = 0.85, estimated from the >> frequency that >>>>>> an average surfer uses his or her browser's bookmark feature." [2] >>>>>> >>>>>> Now, if I want to try to account for the dangling nodes I need to sum >> all >>>>>> PageRank values for the dangling nodes and divide it evenly among all >> other >>>>>> pages (which in practce means to send a message to all vertexes in >> Giraph... >>>>>> which is not possible, as it would be crazy with large graphs). >>>>>> >>>>>> I imagine one can use a SumAggregator and alternate computing >> contribution from >>>>>> dangling nodes and PageRank values (i.e. PageRank values are computed >> only every >>>>>> 2 supersteps). >>>>>> >>>>>> What do you think? >>>>>> >>>>>> Paolo >>>>>> >>>>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and >> Beyond: The >>>>>> Science of Search Engine Rankings" (2006) >>>>>> [2] http://en.wikipedia.org/wiki/PageRank >>> >> >