[
https://issues.apache.org/jira/browse/FLINK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951892#comment-14951892
]
Martin Junghanns commented on FLINK-2411:
-----------------------------------------
Hi, I finally found some time to implement that :) I am nearly finished but I
ran into some type erasure issues.
At the moment, the class is defined like that:
{code:java}
public class Summarization<K, VV, EV, VGV, EGV>
implements GraphAlgorithm<K, VV, EV,
Graph<K, Summarization.VertexValue<VGV>,
Summarization.EdgeValue<EGV>>>
{code}
{{VGV}} is the vertex group value type
{{EGV}} is the edge group value type
{{Summarization.VertexValue}} and {{EdgeValue}} store the value and the number
of elements in the group
When the algorithm is initialized, the user needs to provide two KeySelectors:
{code:java}
public Summarization(KeySelector<Vertex<K,VV>, VGV> vertexKeySelector,
KeySelector<Edge<K, EV>, EGV> edgeKeySelector)
{code}
Those KeySelectors are used for grouping, like:
{code:java}
input.getVertices().groupBy(vertexKeySelector)
{code}
but also as an argument for other UDFs, for example:
{code:java}
vertexUnsortedGrouping.reduceGroup(new VertexGroupReducer<>(vertexKeySelector))
{code}
This is where the type erasure problem occurs, as the output type ({{VGV}})
cannot be determined from the input type ({{Vertex<K,VV>}}).
My idea would be to get rid of {{VGV}} and {{EGV}} and use {{VV}} and {{EV}}
directly for grouping / summarization. Maybe this is even the better approach,
because the user can modify vertex and edge values via {{map()}} calls before
summarization is called. Furthermore, it makes the implementation a little less
complex.
wdyt?
> Add basic graph summarization algorithm
> ---------------------------------------
>
> Key: FLINK-2411
> URL: https://issues.apache.org/jira/browse/FLINK-2411
> Project: Flink
> Issue Type: New Feature
> Components: Gelly
> Affects Versions: 0.10
> Reporter: Martin Junghanns
> Assignee: Martin Junghanns
> Priority: Minor
>
> Graph summarization determines a structural grouping of similar vertices and
> edges to condense a graph and thus helps to uncover insights about patterns
> hidden in the graph. It can be used in OLAP-style operations on the graph and
> is similar to group by in SQL but on the graph structure instead of rows.
>
> The graph summarization operator represents every vertex group by a single
> vertex in the summarized graph; edges between vertices in the summary graph
> represent a group of edges between the vertex group members of the original
> graph. Summarization is defined by specifying grouping keys for vertices and
> edges, respectively.
> One publication that presents a Map/Reduce based approach is "Pagrol:
> Parallel graph olap over large-scale attributed graphs", however they
> pre-compute the graph-cube before it can be analyzed. With Flink, we can give
> the user an interactive way of summarizing the graph and do not need to
> compute the cube beforehand.
> A more complex approach focuses on summarization on graph patterns
> "SynopSys: Large Graph Analytics in the SAP HANA Database Through
> Summarization".
> However, I want to start with a simple algorithm that summarizes the graph on
> vertex and optionally edge values and additionally stores a count aggregate
> at summarized vertices/edges.
> Consider the following two examples (e.g., social network with users from
> cities and friendships with timestamp):
>
> h4. Input graph:
>
> Vertices (id, value):
> (0, Leipzig)
> (1, Leipzig)
> (2, Dresden)
> (3, Dresden)
> (4, Dresden)
> (5, Berlin)
> Edges (source, target, value):
> (0, 1, 2014)
> (1, 0, 2014)
> (1, 2, 2013)
> (2, 1, 2013)
> (2, 3, 2014)
> (3, 2, 2014)
> (4, 0, 2013)
> (4, 1, 2015)
> (5, 2, 2015)
> (5, 3, 2015)
> h4. Output graph (summarized on vertex value):
> Vertices (id, value, count)
> (0, Leipzig, 2) // "2 users from Leipzig"
> (2, Dresden, 3) // "3 users from Dresden"
> (5, Berlin, 1) // "1 user from Berlin"
> Edges (source, target, count)
> (0, 0, 2) // "2 edges between users in Leipzig"
> (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
> (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
> (2, 2, 2) // "2 edges between users in Dresden"
> (5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
> h4. Output graph (summarized on vertex and edge value):
> Vertices (id, value, count)
> (0, Leipzig, 2)
> (2, Dresden, 3)
> (5, Berlin, 1)
> Edges (source, target, value, count)
> (0, 0, 2014, 2) // ...
> (0, 2, 2013, 1) // ...
> (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with
> timestamp 2013"
> (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with
> timestamp 2015"
> (2, 2, 2014, 2) // ...
> (5, 2, 2015, 2) // ...
> I've already implemented two versions of the summarization algorithm in our
> own project [Gradoop|https://github.com/dbs-leipzig/gradoop], which is a
> graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model.
> You can see the current WIP here:
> 1 [Abstract
> summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
> 2 [Implementation using
> cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
> 3 [Implementation using
> joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
> 4
> [Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model/impl/EPGraphSummarizeTest.java]
> 5
> [TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf]
> I would basically use the same implementation as in 3 in combination with
> KeySelectors to select the grouping keys on vertices and edges.
> As you can see in the example, each vertex in the resulting graph has a
> vertex id that is contained in the original graph. This id is the smallest id
> among the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2
> is the group representative). The latter is necessary to correctly assign the
> summarized edges. Maybe there is a smarter way to do it of which I did not
> think of yet.
> I would like to contribute this to Flink and of course, if you have any
> suggestions/improvements or do not want this at all (hopefully not), please
> let me know.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)