Could you just accumulate the PR of dangling nodes in a variable and divide
it equally in the next iteration?
If you use a preference vector instead of dividing it equally, then you
have also the strongly preferential (and weakly preferential in case you
divide equally) versions of RWR.

Cheers,
--
Gianmarco




On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna <
castagna.li...@googlemail.com> wrote:

> Sebastian Schelter wrote:
> > You are right, the code would throw an exception here, but I guess it
> > would be sufficient to simply add a check here and not send any messages.
>
> That would avoid throwing an exception, but I am not sure it is actually
> the best thing to do.
> Some suggest to divide the PageRank scores of dangling nodes evenly among
> all other pages.
> If you want, this is equivalent to another random jump, but forced by the
> fact that there are no outgoing links to select from.
> Using your notation below:
>
>  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
> sumOfDanlingNodesRanks / getNumVertices() )
>
> See also pag. 38 of the book ""Google's PageRank and Beyond":
> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>
> In other works, in pseudo code a serial implementation which accounts for
> dangling nodes should be:
>
> --------
> dangling_nodes_ranks = 0
> foreach node in dangling_node
>    dangling_nodes_ranks += ranks.get ( node )
> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>
> foeach node1 in graph
>    r = 0
>    foreach node2 in incoming_links ( node1 )
>        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>
> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
> --------
>
> What do you think?
>
> Paolo
>
> >
> > --sebastian
> >
> >
> > On 17.05.2012 23:44, Paolo Castagna wrote:
> >> Hi Sebastian,
> >> from SimplePageRankVertex.java, we have:
> >>
> >>     if (getSuperstep() < MAX_SUPERSTEPS) {
> >>       long edges = getNumOutEdges();
> >>       sendMsgToAllEdges(
> >>           new DoubleWritable(getVertexValue().get() / edges));
> >>     } else {
> >>       voteToHalt();
> >>     }
> >>
> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
> >>
> >> Paolo
> >>
> >> Sebastian Schelter wrote:
> >>> The implementation implicitly deals with dangling nodes by computing
> the
> >>> pageRank as
> >>>
> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
> >>>
> >>> Conceptually, this is the same as connecting every vertex with every
> >>> other vertex it is not yet connected to. This removes all dangling
> nodes
> >>> from the graph as every vertex can reach every other vertex.
> >>>
> >>> "Mining of Massive Datasets" [1] has a very well written explanation of
> >>> this approach.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
> >>>
> >>> On 17.05.2012 16:23, Paolo Castagna wrote:
> >>>> Hi,
> >>>> I noticed that the SimplePageRankVertex implementation does not deal
> with
> >>>> dangling nodes (dangling nodes are those nodes with no outgoing
> links).
> >>>> And, probably that is one of the good reasons why there is a "Simple"
> in front. ;-)
> >>>>
> >>>>  "Dangling nodes exist in many forms. For example, a page of data, a
> page with a
> >>>> postscript graph, [...], a page that has been fetched by a crawler
> but not yet
> >>>> explored - these are all examples of possible dangling nodes. The
> more ambitious
> >>>> the crawl, the bigger the proportion of dangling nodes because the
> set of
> >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
> >>>>
> >>>> Wikipedia's PageRank page says:
> >>>>
> >>>>  "If a page has no links to other pages, it becomes a sink and
> therefore
> >>>> terminates the random surfing process. If the random surfer arrives
> at a sink
> >>>> page, it picks another URL at random and continues surfing again. When
> >>>> calculating PageRank, pages with no outbound links are assumed to
> link out to
> >>>> all other pages in the collection. Their PageRank scores are
> therefore divided
> >>>> evenly among all other pages. In other words, to be fair with pages
> that are not
> >>>> sinks, these random transitions are added to all nodes in the Web,
> with a
> >>>> residual probability usually set to d = 0.85, estimated from the
> frequency that
> >>>> an average surfer uses his or her browser's bookmark feature." [2]
> >>>>
> >>>> Now, if I want to try to account for the dangling nodes I need to sum
> all
> >>>> PageRank values for the dangling nodes and divide it evenly among all
> other
> >>>> pages (which in practce means to send a message to all vertexes in
> Giraph...
> >>>> which is not possible, as it would be crazy with large graphs).
> >>>>
> >>>> I imagine one can use a SumAggregator and alternate computing
> contribution from
> >>>> dangling nodes and PageRank values (i.e. PageRank values are computed
> only every
> >>>> 2 supersteps).
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Paolo
> >>>>
> >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
> Beyond: The
> >>>> Science of Search Engine Rankings" (2006)
> >>>>  [2] http://en.wikipedia.org/wiki/PageRank
> >
>

Reply via email to