[
https://issues.apache.org/jira/browse/FLINK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704730#comment-14704730
]
Gabor Gevay commented on FLINK-2548:
------------------------------------
{quote}
CoGrouping with the solution set is a different runtime operator than the
regular CoGroup.
{quote}
I see, thanks! I think this should be documented in the iterations programming
guide, and in the javadoc for coGroup and getSolutionSet.
{quote}
(2) To have an iterator for the "message" sending, you can group by source
vertex. Then you group again by target vertex, to have an iterator for incoming
messages (either reduce+join or CoGroup). So you have now join + reduce +
reduce + join in teh iteration?
{quote}
For calling the MessagingFunction, I join the edges with the workset, and then
do a reduceGroup by source vertex. This gives me an iterator over the neighbors
inside the UDF of the reduceGroup. (Well, almost. That iterator is a little
different from what the messaging function expects, so I wrote an iterator
adaptor.)
See my code \[1\], and here is an execution plan which involves the new code:
\[2\]. I think this code is memory-safe in the way you described.
\[1\]
https://github.com/apache/flink/compare/master...ggevay:gelly-remove-coGroup-first-version
\[2\] http://compalg.inf.elte.hu/~ggevay/flink/Plan_join_groupBy.txt
> VertexCentricIteration should avoid doing a coGroup with the edges and the
> solution set
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-2548
> URL: https://issues.apache.org/jira/browse/FLINK-2548
> Project: Flink
> Issue Type: Improvement
> Components: Gelly
> Affects Versions: 0.9, 0.10
> Reporter: Gabor Gevay
> Assignee: Gabor Gevay
>
> Currently, the performance of vertex centric iteration is suboptimal in those
> iterations where the workset is small, because the complexity of one
> iteration contains the number of edges and vertices of the graph because of
> coGroups:
> VertexCentricIteration.buildMessagingFunction does a coGroup between the
> edges and the workset, to get the neighbors to the messaging UDF. This is
> problematic from a performance point of view, because the coGroup UDF gets
> called on all the edge groups, including those that are not getting any
> messages.
> An analogous problem is present in
> VertexCentricIteration.createResultSimpleVertex at the creation of the
> updates: a coGroup happens between the messages and the solution set, which
> has the number of vertices of the graph included in its complexity.
> Both of these coGroups could be avoided by doing a join instead (with the
> same keys that the coGroup uses), and then a groupBy. The complexity of these
> operations would be dominated by the size of the workset, as opposed to the
> number of edges or vertices of the graph. The joins should have the edges and
> the solution set at the build side to achieve this complexity. (They will not
> be rebuilt at every iteration.)
> I made some experiments with this, and the initial results seem promising. On
> some workloads, this achieves a 2 times speedup, because later iterations
> often have quite small worksets, and these get a huge speedup from this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)