Github user s1ck commented on the pull request:
https://github.com/apache/flink/pull/1075#issuecomment-136380943
There is already a test case for zipWithUniqueId() in
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/util/DataSetUtilsITCase.java#L66
However, this test is under the assumption that there is only one task
running, which is why it did not fail in the first place.
If there are multiple tasks, the resulting unique id is not deterministic
for a single dataset element. I would implement a test, that creates a dataset,
applies the `zipWithUniqueId` method, calls `distinct(0)` on the created ids
and checks the number of resulting elements (must be equal to the input
dataset). Would this be sufficient?
Furthermore, the current test cases for `DataSetUtils` assume a resulting
dataset as string and check this after each test run. My proposed test would
not fit in that scheme. Should I create a new test case class for this method?
@StephanEwen I wanted to do this, but static doesn't work with anonymous
classes. However, I can declare the UDF as a private inner class (didn't want
to change much code).
@HuangWHWHW the `log2` method already existed and in the issue, I proposed
to rename it. Maybe `getBitSize(long value)`? As for the "proof": if each task
id is smaller than the total number of parallel tasks t, its bit representation
is also smaller than the bit representation of t. Thus, when we shift the
counter by the number of bits of t, there cannot be a collision for different
task ids
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---