[ 
https://issues.apache.org/jira/browse/FLINK-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619509#comment-14619509
 ] 

ASF GitHub Bot commented on FLINK-2310:
---------------------------------------

Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/892#issuecomment-119750756
  
    Hi @shghatge,
    thank you for working on this! It's not a trivial problem :)
    
    I have two general comments:
    
    - I would prefer to see this as library method instead of an example. As an 
example, it shows how to use neighborhood methods, joinWithVertices and 
getTriplets, but we already have the JaccardSimilarity for all of these.
    
    - The algorithm is actually very hard to scale. It essentially requires 
common neighborhood calculation. This is a real pain for large graphs with 
skewed vertices. @andralungu can explain to you more, since she's been working 
on optimizing similar problems for her work.
    An additional problem that this particular implementation has is that it 
also stores and sends vertex values, not only vertex IDs. If you try to run 
this with a big graph, it will probably crush or run forever.
    
    There are 2 things we could do:
    
    1). Only send a list of vertex IDs and let the receiving vertex generate 
partial sums for each common neighbor. At the end, we can aggregate the sums 
per edge. For example, consider the edges (1, 2), (1, 3), (2, 3), (1, 4) and 
(2, 4). In the neighborhood method, vertex 3 will find that it has a common 
neighbors with 1, vertex 2. Then, it will emit a tuple `<1, 2, 1/log(d3)>`. In 
the same way, vertex 4 will find that it has a common neighbor with 1, vertex 2 
and it will emit a tuple `<1, 2, 1/log(d4)>`.
    The output of the neighborhood method will be a dataset of tuples that 
represent edges with partial sums. You can then compute the Adamic-Adar 
similarity by summing up the values per vertex.
    
    2). The above solution will be OK for regular graphs, but will still have 
problems in the presence of skew. Another approach would be to offer an 
approximate Adamic-Adar similarity, by representing neighborhoods as bloom 
filters. In this case, the size of common neighborhoods will be over-estimated, 
but the computation will be faster.
    
    @shghatge, @andralungu, let me know what you think and how you would like 
to proceed! 


> Add an Adamic-Adar Similarity example
> -------------------------------------
>
>                 Key: FLINK-2310
>                 URL: https://issues.apache.org/jira/browse/FLINK-2310
>             Project: Flink
>          Issue Type: Task
>          Components: Gelly
>            Reporter: Andra Lungu
>            Assignee: Shivani Ghatge
>            Priority: Minor
>
> Just as Jaccard, the Adamic-Adar algorithm measures the similarity between a 
> set of nodes. However, instead of counting the common neighbors and dividing 
> them by the total number of neighbors, the similarity is weighted according 
> to the vertex degrees. In particular, it's equal to log(1/numberOfEdges).
> The Adamic-Adar algorithm can be broken into three steps: 
> 1). For each vertex, compute the log of its inverse degrees (with the formula 
> above) and set it as the vertex value. 
> 2). Each vertex will then send this new computed value along with a list of 
> neighbors to the targets of its out-edges
> 3). Weigh the edges with the Adamic-Adar index: Sum over n from CN of 
> log(1/k_n)(CN is the set of all common neighbors of two vertices x, y. k_n is 
> the degree of node n). See [2]
> Prerequisites: 
> - Full understanding of the Jaccard Similarity Measure algorithm
> - Reading the associated literature: 
> [1] http://social.cs.uiuc.edu/class/cs591kgk/friendsadamic.pdf
> [2] 
> http://stackoverflow.com/questions/22565620/fast-algorithm-to-compute-adamic-adar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to