[
https://issues.apache.org/jira/browse/BEAM-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307827#comment-15307827
]
Aljoscha Krettek commented on BEAM-315:
---------------------------------------
I attached a version that uses a {{String}} as key. With this, the results are
also wrong but "less wrong" than with the {{Key}} class. I think the problem
with having {{Key}} as a key is that {{AvroCoder.consistentWithEquals()}} is
{{false}} and the Flink runner uses the serialized bytes to do comparisons. Not
sure how the Dataflow runner deals with this, though. Also, once data is
sufficiently large for the bug to appear the pipeline can not be executed on
the {{DirectPipelineRunner}} or the {{InProcessPipelineRunner}} because both
fail with a OOM exception.
> GroupByKey/CoGroupByKey doesn't group correctly with FlinkPipelineRunner
> ------------------------------------------------------------------------
>
> Key: BEAM-315
> URL: https://issues.apache.org/jira/browse/BEAM-315
> Project: Beam
> Issue Type: Bug
> Components: runner-flink
> Affects Versions: 0.1.0-incubating
> Reporter: Pawel Szczur
> Attachments: CoGroupPipelineStringKey.java
>
>
> Same keys are processed multiple times.
> A repo to reproduce the bug:
> https://github.com/orian/cogroup-wrong-grouping
> Discussion:
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201605.mbox/%3CCAB2uKkG2xHsWpLFUkYnt8eEzdxU%3DB_nu6crTwVi-ZuUpugxkPQ%40mail.gmail.com%3E
> Notice: I haven't tested other runners (didn't manage to configure Spark).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)