[
https://issues.apache.org/jira/browse/FLINK-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293438#comment-15293438
]
ASF GitHub Bot commented on FLINK-3780:
---------------------------------------
Github user vasia commented on a diff in the pull request:
https://github.com/apache/flink/pull/1980#discussion_r64049343
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2055,22 +2055,22 @@ vertex and edge in the output graph stores the
common group value and the number
### Jaccard Index
#### Overview
-The Jaccard Index measures the similarity between vertex neighborhoods.
Scores range from 0.0 (no common neighbors) to
-1.0 (all neighbors are common).
+The Jaccard Index measures the similarity between vertex neighborhoods and
is computed as the number of shared numbers
+divided by the number of distinct neighbors. Scores range from 0.0 (no
shared neighbors) to 1.0 (all neighbors are
+shared).
#### Details
-Counting common neighbors for pairs of vertices is equivalent to counting
the two-paths consisting of two edges
-connecting the two vertices to the common neighbor. The number of distinct
neighbors for pairs of vertices is computed
-by storing the sum of degrees of the vertex pair and subtracting the count
of common neighbors, which are double-counted
-in the sum of degrees.
+Counting shared neighbors for pairs of vertices is equivalent to counting
connecting paths of length two. The number of
+distinct neighbors is computed by storing the sum of degrees of the vertex
pair and subtracting the count of shared
+neighbors, which are double-counted in the sum of degrees.
-The algorithm first annotates each edge with the endpoint degree. Grouping
on the midpoint vertex, each pair of
-neighbors is emitted with the endpoint degree sum. Grouping on two-paths,
the common neighbors are counted.
+The algorithm first annotates each edge with the target vertex's degree.
Grouping on the source vertex, each pair of
+neighbors is emitted with the degree sum. Grouping on two-paths, the
shared neighbors are counted.
#### Usage
The algorithm takes a simple, undirected graph as input and outputs a
`DataSet` of tuples containing two vertex IDs,
-the number of common neighbors, and the number of distinct neighbors. The
graph ID type must be `Comparable` and
-`Copyable`.
+the number of shared neighbors, and the number of distinct neighbors. The
result class provides a method to compute the
+Jaccard Index score. The graph ID type must be `Comparable` and `Copyable`.
--- End diff --
Here we should also document what is the output of the algorithm, i.e. the
`Result` type and how to get the jaccard similarity out of it.
> Jaccard Similarity
> ------------------
>
> Key: FLINK-3780
> URL: https://issues.apache.org/jira/browse/FLINK-3780
> Project: Flink
> Issue Type: New Feature
> Components: Gelly
> Affects Versions: 1.1.0
> Reporter: Greg Hogan
> Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Implement a Jaccard Similarity algorithm computing all non-zero similarity
> scores. This algorithm is similar to {{TriangleListing}} but instead of
> joining two-paths against an edge list we count two-paths.
> {{flink-gelly-examples}} currently has {{JaccardSimilarityMeasure}} which
> relies on {{Graph.getTriplets()}} so only computes similarity scores for
> neighbors but not neighbors-of-neighbors.
> This algorithm is easily modified for other similarity scores such as
> Adamic-Adar similarity where the sum of endpoint degrees is replaced by the
> degree of the middle vertex.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)