Hi Sebastian

Sebastian Schelter wrote:
> Accumulation should be possible with aggregators and the idea with
> applying preference vectors seems very interesting too.

Indeed.

But...

 - How big is the preference vector? (it's N elements, where N is the number of 
vertexes in your graph, right?)
 - How do we load that in (the Aggregator)? Can we do it just once at the 
beginning?
 - How user can determine the preference vector? (this is not something we need 
to be too much worried about, it's a user-land problem...)
 - ...

So, I think dealing properly with dangling nodes is much more important first 
step.

> Maybe these should be implemented as additional features in GIRAPH-191?

Yeah, it would be good to have the capabilities in a RandomWalkVertex (as you 
showed) and maybe use that in a 'proper' PageRankVertex implementation (more 
configurable and which deals with dangling
nodes and/or iterating until convergence is reached...)

Paolo

> 
> On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote:
>> Could you just accumulate the PR of dangling nodes in a variable and divide
>> it equally in the next iteration?
>> If you use a preference vector instead of dividing it equally, then you
>> have also the strongly preferential (and weakly preferential in case you
>> divide equally) versions of RWR.
>>
>> Cheers,
>> --
>> Gianmarco
>>
>>
>>
>>
>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna <
>> castagna.li...@googlemail.com> wrote:
>>
>>> Sebastian Schelter wrote:
>>>> You are right, the code would throw an exception here, but I guess it
>>>> would be sufficient to simply add a check here and not send any messages.
>>> That would avoid throwing an exception, but I am not sure it is actually
>>> the best thing to do.
>>> Some suggest to divide the PageRank scores of dangling nodes evenly among
>>> all other pages.
>>> If you want, this is equivalent to another random jump, but forced by the
>>> fact that there are no outgoing links to select from.
>>> Using your notation below:
>>>
>>>  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>>> sumOfDanlingNodesRanks / getNumVertices() )
>>>
>>> See also pag. 38 of the book ""Google's PageRank and Beyond":
>>> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>>
>>> In other works, in pseudo code a serial implementation which accounts for
>>> dangling nodes should be:
>>>
>>> --------
>>> dangling_nodes_ranks = 0
>>> foreach node in dangling_node
>>>    dangling_nodes_ranks += ranks.get ( node )
>>> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>>
>>> foeach node1 in graph
>>>    r = 0
>>>    foreach node2 in incoming_links ( node1 )
>>>        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>>
>>> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
>>> --------
>>>
>>> What do you think?
>>>
>>> Paolo
>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 17.05.2012 23:44, Paolo Castagna wrote:
>>>>> Hi Sebastian,
>>>>> from SimplePageRankVertex.java, we have:
>>>>>
>>>>>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>>>>       long edges = getNumOutEdges();
>>>>>       sendMsgToAllEdges(
>>>>>           new DoubleWritable(getVertexValue().get() / edges));
>>>>>     } else {
>>>>>       voteToHalt();
>>>>>     }
>>>>>
>>>>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
>>>>>
>>>>> Paolo
>>>>>
>>>>> Sebastian Schelter wrote:
>>>>>> The implementation implicitly deals with dangling nodes by computing
>>> the
>>>>>> pageRank as
>>>>>>
>>>>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>>>>
>>>>>> Conceptually, this is the same as connecting every vertex with every
>>>>>> other vertex it is not yet connected to. This removes all dangling
>>> nodes
>>>>>> from the graph as every vertex can reach every other vertex.
>>>>>>
>>>>>> "Mining of Massive Datasets" [1] has a very well written explanation of
>>>>>> this approach.
>>>>>>
>>>>>> --sebastian
>>>>>>
>>>>>>
>>>>>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>>>>
>>>>>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>>>>> Hi,
>>>>>>> I noticed that the SimplePageRankVertex implementation does not deal
>>> with
>>>>>>> dangling nodes (dangling nodes are those nodes with no outgoing
>>> links).
>>>>>>> And, probably that is one of the good reasons why there is a "Simple"
>>> in front. ;-)
>>>>>>>  "Dangling nodes exist in many forms. For example, a page of data, a
>>> page with a
>>>>>>> postscript graph, [...], a page that has been fetched by a crawler
>>> but not yet
>>>>>>> explored - these are all examples of possible dangling nodes. The
>>> more ambitious
>>>>>>> the crawl, the bigger the proportion of dangling nodes because the
>>> set of
>>>>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>>>>>
>>>>>>> Wikipedia's PageRank page says:
>>>>>>>
>>>>>>>  "If a page has no links to other pages, it becomes a sink and
>>> therefore
>>>>>>> terminates the random surfing process. If the random surfer arrives
>>> at a sink
>>>>>>> page, it picks another URL at random and continues surfing again. When
>>>>>>> calculating PageRank, pages with no outbound links are assumed to
>>> link out to
>>>>>>> all other pages in the collection. Their PageRank scores are
>>> therefore divided
>>>>>>> evenly among all other pages. In other words, to be fair with pages
>>> that are not
>>>>>>> sinks, these random transitions are added to all nodes in the Web,
>>> with a
>>>>>>> residual probability usually set to d = 0.85, estimated from the
>>> frequency that
>>>>>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>>>>>
>>>>>>> Now, if I want to try to account for the dangling nodes I need to sum
>>> all
>>>>>>> PageRank values for the dangling nodes and divide it evenly among all
>>> other
>>>>>>> pages (which in practce means to send a message to all vertexes in
>>> Giraph...
>>>>>>> which is not possible, as it would be crazy with large graphs).
>>>>>>>
>>>>>>> I imagine one can use a SumAggregator and alternate computing
>>> contribution from
>>>>>>> dangling nodes and PageRank values (i.e. PageRank values are computed
>>> only every
>>>>>>> 2 supersteps).
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Paolo
>>>>>>>
>>>>>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>>> Beyond: The
>>>>>>> Science of Search Engine Rankings" (2006)
>>>>>>>  [2] http://en.wikipedia.org/wiki/PageRank
> 

Reply via email to