Andra Lungu created FLINK-2361:
----------------------------------
Summary: flatMap + distict gives eroneous results for big data sets
Key: FLINK-2361
URL: https://issues.apache.org/jira/browse/FLINK-2361
Project: Flink
Issue Type: Bug
Components: Gelly
Affects Versions: 0.10
Reporter: Andra Lungu
When running the simple Connected Components algorithm (currently in Gelly) on
the twitter follower graph, with 1, 100 or 10000 iterations, I get the
following error:
Caused by: java.lang.Exception: Target vertex '657282846' does not exist!.
at
org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300)
at
org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220)
at
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
at
org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139)
at
org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107)
at
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:722)
Now this is very bizzare as the DataSet of vertices is produced from the
DataSet of edges... Which means there cannot be a an edge with an invalid
target id... The method calls flatMap to isolate the src and trg ids and
distinct to ensure their uniqueness.
The algorithm works fine for smaller data sets...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)