Hi Paolo,

Very good points, could you add them to GIRAPH-191?

I think that the convergence check does not need an extra superstep. It
is sufficient to compute the L1-norm of the current pageRank vector,
which can be done by an aggregator and compare it to the previous
aggregated value.

I guess that summing up and redistributing the pagerank of dangling
vertices can also be done without an extra superstep in an aggregator.

--sebastian

On 18.05.2012 12:31, Paolo Castagna wrote:
> Hi Gianmarco
> 
> Gianmarco De Francisci Morales wrote:
>> Could you just accumulate the PR of dangling nodes in a variable and
>> divide it equally in the next iteration?
> 
> Yep, this is exactly what I was thinking when I wrote: "I imagine one can use 
> a SumAggregator and alternate computing contribution from dangling nodes and 
> PageRank values (i.e. PageRank values are
> computed only every 2 supersteps).
> 
> You need to use an Aggregator and spend a superstep just to aggregate/sum the 
> PageRank of all danling nodes.
> Then, in the successive superstep you can use that divided by the number of 
> nodes in the graph to equally among all other nodes (i.e. the 
> sumOfDanlingNodesRanks / getNumVertices() below).
> 
> Another, change/enhancement could be to check for convergence (to do that, 
> you need to keep current and previous PageRank values) and spend another 
> superstep just doing that.
> But, as you can see things stop being simple and the point of 
> SimplePageRankVertex is to offer a very simple example.
> I did not started this thread with the aim to change SimplePageRankVertex, I 
> think it should stay as it is (and as the Google's Pregel paper describes it).
> 
> However, in practice, if you want to run PageRank or similar, you need to 
> deal with problems such as: what to do with dangling nodes/pages? how to know 
> how many iterations you need to reach the
> precision you want/need? Etc.
> 
> My initial message was to exchange some ideas and get some feedback and/or 
> suggestions on how a not too simple PageRankVertex might look. ;-)
> 
>> If you use a preference vector instead of dividing it equally, then you
>> have also the strongly preferential (and weakly preferential in case you
>> divide equally) versions of RWR.
> 
> Yep, but how you get your preference vector? ;-)
> 
> As a first step, it would be great if we can come up with a PageRankVertex 
> which takes into account dangling nodes and perhaps/optionally check for 
> convergence every few supersteps.
> 
> Anyway, thanks for your comments and feedback Gianmarco.
> 
> Paolo
> 
>>
>> Cheers,
>> --
>> Gianmarco
>>
>>
>>
>>
>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna
>> <castagna.li...@googlemail.com <mailto:castagna.li...@googlemail.com>>
>> wrote:
>>
>>     Sebastian Schelter wrote:
>>     > You are right, the code would throw an exception here, but I guess it
>>     > would be sufficient to simply add a check here and not send any
>>     messages.
>>
>>     That would avoid throwing an exception, but I am not sure it is
>>     actually the best thing to do.
>>     Some suggest to divide the PageRank scores of dangling nodes evenly
>>     among all other pages.
>>     If you want, this is equivalent to another random jump, but forced
>>     by the fact that there are no outgoing links to select from.
>>     Using your notation below:
>>
>>      0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>>     sumOfDanlingNodesRanks / getNumVertices() )
>>
>>     See also pag. 38 of the book ""Google's PageRank and Beyond":
>>     http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>     <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38>
>>
>>     In other works, in pseudo code a serial implementation which
>>     accounts for dangling nodes should be:
>>
>>     --------
>>     dangling_nodes_ranks = 0
>>     foreach node in dangling_node
>>        dangling_nodes_ranks += ranks.get ( node )
>>     dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>
>>     foeach node1 in graph
>>        r = 0
>>        foreach node2 in incoming_links ( node1 )
>>            r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>
>>     r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor
>>     ) / N
>>     --------
>>
>>     What do you think?
>>
>>     Paolo
>>
>>     >
>>     > --sebastian
>>     >
>>     >
>>     > On 17.05.2012 23:44, Paolo Castagna wrote:
>>     >> Hi Sebastian,
>>     >> from SimplePageRankVertex.java, we have:
>>     >>
>>     >>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>     >>       long edges = getNumOutEdges();
>>     >>       sendMsgToAllEdges(
>>     >>           new DoubleWritable(getVertexValue().get() / edges));
>>     >>     } else {
>>     >>       voteToHalt();
>>     >>     }
>>     >>
>>     >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling
>>     node)?
>>     >>
>>     >> Paolo
>>     >>
>>     >> Sebastian Schelter wrote:
>>     >>> The implementation implicitly deals with dangling nodes by
>>     computing the
>>     >>> pageRank as
>>     >>>
>>     >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>     >>>
>>     >>> Conceptually, this is the same as connecting every vertex with every
>>     >>> other vertex it is not yet connected to. This removes all
>>     dangling nodes
>>     >>> from the graph as every vertex can reach every other vertex.
>>     >>>
>>     >>> "Mining of Massive Datasets" [1] has a very well written
>>     explanation of
>>     >>> this approach.
>>     >>>
>>     >>> --sebastian
>>     >>>
>>     >>>
>>     >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>     >>>
>>     >>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>     >>>> Hi,
>>     >>>> I noticed that the SimplePageRankVertex implementation does not
>>     deal with
>>     >>>> dangling nodes (dangling nodes are those nodes with no outgoing
>>     links).
>>     >>>> And, probably that is one of the good reasons why there is a
>>     "Simple" in front. ;-)
>>     >>>>
>>     >>>>  "Dangling nodes exist in many forms. For example, a page of
>>     data, a page with a
>>     >>>> postscript graph, [...], a page that has been fetched by a
>>     crawler but not yet
>>     >>>> explored - these are all examples of possible dangling nodes.
>>     The more ambitious
>>     >>>> the crawl, the bigger the proportion of dangling nodes because
>>     the set of
>>     >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>     >>>>
>>     >>>> Wikipedia's PageRank page says:
>>     >>>>
>>     >>>>  "If a page has no links to other pages, it becomes a sink and
>>     therefore
>>     >>>> terminates the random surfing process. If the random surfer
>>     arrives at a sink
>>     >>>> page, it picks another URL at random and continues surfing
>>     again. When
>>     >>>> calculating PageRank, pages with no outbound links are assumed
>>     to link out to
>>     >>>> all other pages in the collection. Their PageRank scores are
>>     therefore divided
>>     >>>> evenly among all other pages. In other words, to be fair with
>>     pages that are not
>>     >>>> sinks, these random transitions are added to all nodes in the
>>     Web, with a
>>     >>>> residual probability usually set to d = 0.85, estimated from
>>     the frequency that
>>     >>>> an average surfer uses his or her browser's bookmark feature." [2]
>>     >>>>
>>     >>>> Now, if I want to try to account for the dangling nodes I need
>>     to sum all
>>     >>>> PageRank values for the dangling nodes and divide it evenly
>>     among all other
>>     >>>> pages (which in practce means to send a message to all vertexes
>>     in Giraph...
>>     >>>> which is not possible, as it would be crazy with large graphs).
>>     >>>>
>>     >>>> I imagine one can use a SumAggregator and alternate computing
>>     contribution from
>>     >>>> dangling nodes and PageRank values (i.e. PageRank values are
>>     computed only every
>>     >>>> 2 supersteps).
>>     >>>>
>>     >>>> What do you think?
>>     >>>>
>>     >>>> Paolo
>>     >>>>
>>     >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>>     Beyond: The
>>     >>>> Science of Search Engine Rankings" (2006)
>>     >>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>     >
>>
>>

Reply via email to