Hi,
I noticed that the SimplePageRankVertex implementation does not deal with
dangling nodes (dangling nodes are those nodes with no outgoing links).
And, probably that is one of the good reasons why there is a "Simple" in front. 
;-)

 "Dangling nodes exist in many forms. For example, a page of data, a page with a
postscript graph, [...], a page that has been fetched by a crawler but not yet
explored - these are all examples of possible dangling nodes. The more ambitious
the crawl, the bigger the proportion of dangling nodes because the set of
fetched but uncrawled pages grows quickly." (pag. 80) [1]

Wikipedia's PageRank page says:

 "If a page has no links to other pages, it becomes a sink and therefore
terminates the random surfing process. If the random surfer arrives at a sink
page, it picks another URL at random and continues surfing again. When
calculating PageRank, pages with no outbound links are assumed to link out to
all other pages in the collection. Their PageRank scores are therefore divided
evenly among all other pages. In other words, to be fair with pages that are not
sinks, these random transitions are added to all nodes in the Web, with a
residual probability usually set to d = 0.85, estimated from the frequency that
an average surfer uses his or her browser's bookmark feature." [2]

Now, if I want to try to account for the dangling nodes I need to sum all
PageRank values for the dangling nodes and divide it evenly among all other
pages (which in practice means to send a message to all vertexes in Giraph...
which is not possible, as it would be crazy with large graphs).

I imagine one can use a SumAggregator and alternate computing contribution from
dangling nodes and PageRank values (i.e. PageRank values are computed only every
2 supersteps).

What do you think?

Paolo

 [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
Science of Search Engine Rankings" (2006)
 [2] http://en.wikipedia.org/wiki/PageRank

Reply via email to