Hi Paolo, Very good points, could you add them to GIRAPH-191?

I think that the convergence check does not need an extra superstep. It is sufficient to compute the L1-norm of the current pageRank vector, which can be done by an aggregator and compare it to the previous aggregated value. I guess that summing up and redistributing the pagerank of dangling vertices can also be done without an extra superstep in an aggregator. --sebastian On 18.05.2012 12:31, Paolo Castagna wrote: > Hi Gianmarco > > Gianmarco De Francisci Morales wrote: >> Could you just accumulate the PR of dangling nodes in a variable and >> divide it equally in the next iteration? > > Yep, this is exactly what I was thinking when I wrote: "I imagine one can use > a SumAggregator and alternate computing contribution from dangling nodes and > PageRank values (i.e. PageRank values are > computed only every 2 supersteps). > > You need to use an Aggregator and spend a superstep just to aggregate/sum the > PageRank of all danling nodes. > Then, in the successive superstep you can use that divided by the number of > nodes in the graph to equally among all other nodes (i.e. the > sumOfDanlingNodesRanks / getNumVertices() below). > > Another, change/enhancement could be to check for convergence (to do that, > you need to keep current and previous PageRank values) and spend another > superstep just doing that. > But, as you can see things stop being simple and the point of > SimplePageRankVertex is to offer a very simple example. > I did not started this thread with the aim to change SimplePageRankVertex, I > think it should stay as it is (and as the Google's Pregel paper describes it). > > However, in practice, if you want to run PageRank or similar, you need to > deal with problems such as: what to do with dangling nodes/pages? how to know > how many iterations you need to reach the > precision you want/need? Etc. > > My initial message was to exchange some ideas and get some feedback and/or > suggestions on how a not too simple PageRankVertex might look. ;-) > >> If you use a preference vector instead of dividing it equally, then you >> have also the strongly preferential (and weakly preferential in case you >> divide equally) versions of RWR. > > Yep, but how you get your preference vector? ;-) > > As a first step, it would be great if we can come up with a PageRankVertex > which takes into account dangling nodes and perhaps/optionally check for > convergence every few supersteps. > > Anyway, thanks for your comments and feedback Gianmarco. > > Paolo > >> >> Cheers, >> -- >> Gianmarco >> >> >> >> >> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna >> <castagna.li...@googlemail.com <mailto:castagna.li...@googlemail.com>> >> wrote: >> >> Sebastian Schelter wrote: >> > You are right, the code would throw an exception here, but I guess it >> > would be sufficient to simply add a check here and not send any >> messages. >> >> That would avoid throwing an exception, but I am not sure it is >> actually the best thing to do. >> Some suggest to divide the PageRank scores of dangling nodes evenly >> among all other pages. >> If you want, this is equivalent to another random jump, but forced >> by the fact that there are no outgoing links to select from. >> Using your notation below: >> >> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + >> sumOfDanlingNodesRanks / getNumVertices() ) >> >> See also pag. 38 of the book ""Google's PageRank and Beyond": >> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 >> <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38> >> >> In other works, in pseudo code a serial implementation which >> accounts for dangling nodes should be: >> >> -------- >> dangling_nodes_ranks = 0 >> foreach node in dangling_node >> dangling_nodes_ranks += ranks.get ( node ) >> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; >> >> foeach node1 in graph >> r = 0 >> foreach node2 in incoming_links ( node1 ) >> r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) >> >> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor >> ) / N >> -------- >> >> What do you think? >> >> Paolo >> >> > >> > --sebastian >> > >> > >> > On 17.05.2012 23:44, Paolo Castagna wrote: >> >> Hi Sebastian, >> >> from SimplePageRankVertex.java, we have: >> >> >> >> if (getSuperstep() < MAX_SUPERSTEPS) { >> >> long edges = getNumOutEdges(); >> >> sendMsgToAllEdges( >> >> new DoubleWritable(getVertexValue().get() / edges)); >> >> } else { >> >> voteToHalt(); >> >> } >> >> >> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling >> node)? >> >> >> >> Paolo >> >> >> >> Sebastian Schelter wrote: >> >>> The implementation implicitly deals with dangling nodes by >> computing the >> >>> pageRank as >> >>> >> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >> >>> >> >>> Conceptually, this is the same as connecting every vertex with every >> >>> other vertex it is not yet connected to. This removes all >> dangling nodes >> >>> from the graph as every vertex can reach every other vertex. >> >>> >> >>> "Mining of Massive Datasets" [1] has a very well written >> explanation of >> >>> this approach. >> >>> >> >>> --sebastian >> >>> >> >>> >> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html >> >>> >> >>> On 17.05.2012 16:23, Paolo Castagna wrote: >> >>>> Hi, >> >>>> I noticed that the SimplePageRankVertex implementation does not >> deal with >> >>>> dangling nodes (dangling nodes are those nodes with no outgoing >> links). >> >>>> And, probably that is one of the good reasons why there is a >> "Simple" in front. ;-) >> >>>> >> >>>> "Dangling nodes exist in many forms. For example, a page of >> data, a page with a >> >>>> postscript graph, [...], a page that has been fetched by a >> crawler but not yet >> >>>> explored - these are all examples of possible dangling nodes. >> The more ambitious >> >>>> the crawl, the bigger the proportion of dangling nodes because >> the set of >> >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >> >>>> >> >>>> Wikipedia's PageRank page says: >> >>>> >> >>>> "If a page has no links to other pages, it becomes a sink and >> therefore >> >>>> terminates the random surfing process. If the random surfer >> arrives at a sink >> >>>> page, it picks another URL at random and continues surfing >> again. When >> >>>> calculating PageRank, pages with no outbound links are assumed >> to link out to >> >>>> all other pages in the collection. Their PageRank scores are >> therefore divided >> >>>> evenly among all other pages. In other words, to be fair with >> pages that are not >> >>>> sinks, these random transitions are added to all nodes in the >> Web, with a >> >>>> residual probability usually set to d = 0.85, estimated from >> the frequency that >> >>>> an average surfer uses his or her browser's bookmark feature." [2] >> >>>> >> >>>> Now, if I want to try to account for the dangling nodes I need >> to sum all >> >>>> PageRank values for the dangling nodes and divide it evenly >> among all other >> >>>> pages (which in practce means to send a message to all vertexes >> in Giraph... >> >>>> which is not possible, as it would be crazy with large graphs). >> >>>> >> >>>> I imagine one can use a SumAggregator and alternate computing >> contribution from >> >>>> dangling nodes and PageRank values (i.e. PageRank values are >> computed only every >> >>>> 2 supersteps). >> >>>> >> >>>> What do you think? >> >>>> >> >>>> Paolo >> >>>> >> >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and >> Beyond: The >> >>>> Science of Search Engine Rankings" (2006) >> >>>> [2] http://en.wikipedia.org/wiki/PageRank >> > >> >>