Hi Sebastian Sebastian Schelter wrote: > Very good points, could you add them to GIRAPH-191?
Done. Summarized with a link to this thread. > I think that the convergence check does not need an extra superstep. It > is sufficient to compute the L1-norm of the current pageRank vector, > which can be done by an aggregator and compare it to the previous > aggregated value. Right. Much better. :-) > I guess that summing up and redistributing the pagerank of dangling > vertices can also be done without an extra superstep in an aggregator. Yeah! Why didn't I think about that? Thanks, great suggestion. I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work. This would also inform design and improvements of the current RandomWalkVertex. Thanks for your comments and suggestions. Paolo > > --sebastian > > On 18.05.2012 12:31, Paolo Castagna wrote: >> Hi Gianmarco >> >> Gianmarco De Francisci Morales wrote: >>> Could you just accumulate the PR of dangling nodes in a variable and >>> divide it equally in the next iteration? >> Yep, this is exactly what I was thinking when I wrote: "I imagine one can >> use a SumAggregator and alternate computing contribution from dangling nodes >> and PageRank values (i.e. PageRank values are >> computed only every 2 supersteps). >> >> You need to use an Aggregator and spend a superstep just to aggregate/sum >> the PageRank of all danling nodes. >> Then, in the successive superstep you can use that divided by the number of >> nodes in the graph to equally among all other nodes (i.e. the >> sumOfDanlingNodesRanks / getNumVertices() below). >> >> Another, change/enhancement could be to check for convergence (to do that, >> you need to keep current and previous PageRank values) and spend another >> superstep just doing that. >> But, as you can see things stop being simple and the point of >> SimplePageRankVertex is to offer a very simple example. >> I did not started this thread with the aim to change SimplePageRankVertex, I >> think it should stay as it is (and as the Google's Pregel paper describes >> it). >> >> However, in practice, if you want to run PageRank or similar, you need to >> deal with problems such as: what to do with dangling nodes/pages? how to >> know how many iterations you need to reach the >> precision you want/need? Etc. >> >> My initial message was to exchange some ideas and get some feedback and/or >> suggestions on how a not too simple PageRankVertex might look. ;-) >> >>> If you use a preference vector instead of dividing it equally, then you >>> have also the strongly preferential (and weakly preferential in case you >>> divide equally) versions of RWR. >> Yep, but how you get your preference vector? ;-) >> >> As a first step, it would be great if we can come up with a PageRankVertex >> which takes into account dangling nodes and perhaps/optionally check for >> convergence every few supersteps. >> >> Anyway, thanks for your comments and feedback Gianmarco. >> >> Paolo >> >>> Cheers, >>> -- >>> Gianmarco >>> >>> >>> >>> >>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna >>> <castagna.li...@googlemail.com <mailto:castagna.li...@googlemail.com>> >>> wrote: >>> >>> Sebastian Schelter wrote: >>> > You are right, the code would throw an exception here, but I guess it >>> > would be sufficient to simply add a check here and not send any >>> messages. >>> >>> That would avoid throwing an exception, but I am not sure it is >>> actually the best thing to do. >>> Some suggest to divide the PageRank scores of dangling nodes evenly >>> among all other pages. >>> If you want, this is equivalent to another random jump, but forced >>> by the fact that there are no outgoing links to select from. >>> Using your notation below: >>> >>> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + >>> sumOfDanlingNodesRanks / getNumVertices() ) >>> >>> See also pag. 38 of the book ""Google's PageRank and Beyond": >>> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 >>> <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38> >>> >>> In other works, in pseudo code a serial implementation which >>> accounts for dangling nodes should be: >>> >>> -------- >>> dangling_nodes_ranks = 0 >>> foreach node in dangling_node >>> dangling_nodes_ranks += ranks.get ( node ) >>> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; >>> >>> foeach node1 in graph >>> r = 0 >>> foreach node2 in incoming_links ( node1 ) >>> r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) >>> >>> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor >>> ) / N >>> -------- >>> >>> What do you think? >>> >>> Paolo >>> >>> > >>> > --sebastian >>> > >>> > >>> > On 17.05.2012 23:44, Paolo Castagna wrote: >>> >> Hi Sebastian, >>> >> from SimplePageRankVertex.java, we have: >>> >> >>> >> if (getSuperstep() < MAX_SUPERSTEPS) { >>> >> long edges = getNumOutEdges(); >>> >> sendMsgToAllEdges( >>> >> new DoubleWritable(getVertexValue().get() / edges)); >>> >> } else { >>> >> voteToHalt(); >>> >> } >>> >> >>> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling >>> node)? >>> >> >>> >> Paolo >>> >> >>> >> Sebastian Schelter wrote: >>> >>> The implementation implicitly deals with dangling nodes by >>> computing the >>> >>> pageRank as >>> >>> >>> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >>> >>> >>> >>> Conceptually, this is the same as connecting every vertex with every >>> >>> other vertex it is not yet connected to. This removes all >>> dangling nodes >>> >>> from the graph as every vertex can reach every other vertex. >>> >>> >>> >>> "Mining of Massive Datasets" [1] has a very well written >>> explanation of >>> >>> this approach. >>> >>> >>> >>> --sebastian >>> >>> >>> >>> >>> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html >>> >>> >>> >>> On 17.05.2012 16:23, Paolo Castagna wrote: >>> >>>> Hi, >>> >>>> I noticed that the SimplePageRankVertex implementation does not >>> deal with >>> >>>> dangling nodes (dangling nodes are those nodes with no outgoing >>> links). >>> >>>> And, probably that is one of the good reasons why there is a >>> "Simple" in front. ;-) >>> >>>> >>> >>>> "Dangling nodes exist in many forms. For example, a page of >>> data, a page with a >>> >>>> postscript graph, [...], a page that has been fetched by a >>> crawler but not yet >>> >>>> explored - these are all examples of possible dangling nodes. >>> The more ambitious >>> >>>> the crawl, the bigger the proportion of dangling nodes because >>> the set of >>> >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>> >>>> >>> >>>> Wikipedia's PageRank page says: >>> >>>> >>> >>>> "If a page has no links to other pages, it becomes a sink and >>> therefore >>> >>>> terminates the random surfing process. If the random surfer >>> arrives at a sink >>> >>>> page, it picks another URL at random and continues surfing >>> again. When >>> >>>> calculating PageRank, pages with no outbound links are assumed >>> to link out to >>> >>>> all other pages in the collection. Their PageRank scores are >>> therefore divided >>> >>>> evenly among all other pages. In other words, to be fair with >>> pages that are not >>> >>>> sinks, these random transitions are added to all nodes in the >>> Web, with a >>> >>>> residual probability usually set to d = 0.85, estimated from >>> the frequency that >>> >>>> an average surfer uses his or her browser's bookmark feature." [2] >>> >>>> >>> >>>> Now, if I want to try to account for the dangling nodes I need >>> to sum all >>> >>>> PageRank values for the dangling nodes and divide it evenly >>> among all other >>> >>>> pages (which in practce means to send a message to all vertexes >>> in Giraph... >>> >>>> which is not possible, as it would be crazy with large graphs). >>> >>>> >>> >>>> I imagine one can use a SumAggregator and alternate computing >>> contribution from >>> >>>> dangling nodes and PageRank values (i.e. PageRank values are >>> computed only every >>> >>>> 2 supersteps). >>> >>>> >>> >>>> What do you think? >>> >>>> >>> >>>> Paolo >>> >>>> >>> >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and >>> Beyond: The >>> >>>> Science of Search Engine Rankings" (2006) >>> >>>> [2] http://en.wikipedia.org/wiki/PageRank >>> > >>> >>> >