[
https://issues.apache.org/jira/browse/FLINK-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627759#comment-14627759
]
Stephan Ewen commented on FLINK-2361:
-------------------------------------
>From past error reports about operator correctness, I found the following:
1. The bug was virtually never in the operator (flatmap / sort-groupreduce).
2. A more common error source what the utility functions for distinct, or
wrappers, which incorrectly handled mutable objects, shared data across
functions, ...
3. Also possible where wrong type utilities (hash functions, comparators)
4. Mostly, it was user code.
I would try to reproduce the bug in the core API. If that is not possible, it
is most likely user code (here the graph code). If that is possible, let's look
at the distinct UDF.
> flatMap + distinct gives erroneous results for big data sets
> ------------------------------------------------------------
>
> Key: FLINK-2361
> URL: https://issues.apache.org/jira/browse/FLINK-2361
> Project: Flink
> Issue Type: Bug
> Components: Gelly
> Affects Versions: 0.10
> Reporter: Andra Lungu
>
> When running the simple Connected Components algorithm (currently in Gelly)
> on the twitter follower graph, with 1, 100 or 10000 iterations, I get the
> following error:
> Caused by: java.lang.Exception: Target vertex '657282846' does not exist!.
> at
> org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300)
> at
> org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220)
> at
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
> at
> org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139)
> at
> org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107)
> at
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:722)
> Now this is very bizzare as the DataSet of vertices is produced from the
> DataSet of edges... Which means there cannot be a an edge with an invalid
> target id... The method calls flatMap to isolate the src and trg ids and
> distinct to ensure their uniqueness.
> The algorithm works fine for smaller data sets...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)