[
https://issues.apache.org/jira/browse/BEAM-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310556#comment-15310556
]
ASF GitHub Bot commented on BEAM-321:
-------------------------------------
GitHub user aljoscha opened a pull request:
https://github.com/apache/incubator-beam/pull/409
[BEAM-321] Fix Flink Comparators
`KvCoderComparator` and `CoderComparator` were hashing the key directly
while doing comparisons on the encoded form. This lead to
inconsistencies in GroupByKey results with large numbers of elements per
key.
This changes the comparators to hash on the encoded form and also adds a
test case to verify correct behavior.
I made the test a `RunnableOnService` test that verifies that runners use
the encoded form of a value for hashing/comparison. In the test, I use a bogus
key that returns garbage from `equals()` and `hashCode()`. The result of the
test is correct if only the encoded form is used for hashing/comparison.
R: @tgroh for the test, do you think there's a better way of doing it?
CC: @tgroh the newly introduced test hangs on the
`InProcessPipelineRunner`, see `InProcessGroupByKeyTest` that I added for
quickly checking this
CC: @francesperry is this new test in line with the Beam model
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aljoscha/incubator-beam flink/fix-comparator
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-beam/pull/409.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #409
----
commit 8b269b542032c887ea149c50245de8d07a94fd51
Author: Aljoscha Krettek <[email protected]>
Date: 2016-06-01T09:56:18Z
[BEAM-321] Fix Flink Comparators
KvCoderComparator and CoderComparator were hashing the key directly
while doing comparisons on the encoded form. This lead to
inconsistencies in GroupByKey results with large numbers of elements per
key.
This changes the comparators to hash on the encoded form and also adds a
test case to verify correct behavior.
commit 6866fd740af07a214ef5a9be39983260878ecb8d
Author: Aljoscha Krettek <[email protected]>
Date: 2016-06-01T09:58:48Z
Fix typo in KvCoderComperator, is now KvCoderComparator
commit 6334b87ef3d2a90c540cf89aadab013dcded1483
Author: Aljoscha Krettek <[email protected]>
Date: 2016-06-01T15:54:37Z
move to RunnableOnService test
commit 85defbc67323e10a87835d2514f45ad238e47842
Author: Aljoscha Krettek <[email protected]>
Date: 2016-06-01T16:14:51Z
make test class package private
----
> Hash encoded keys in Flink batch mode
> -------------------------------------
>
> Key: BEAM-321
> URL: https://issues.apache.org/jira/browse/BEAM-321
> Project: Beam
> Issue Type: Sub-task
> Components: runner-flink
> Affects Versions: 0.1.0-incubating
> Reporter: Maximilian Michels
> Assignee: Aljoscha Krettek
> Fix For: 0.1.0-incubating
>
>
> Right now, hashing of keys happens on the value itself not on the encoded
> representation. This is at odds with the Beam specification and can lead to
> incorrect results.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)