Github user s1ck commented on the pull request:

    https://github.com/apache/flink/pull/1075#issuecomment-136380943
  
    There is already a test case for zipWithUniqueId() in 
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/util/DataSetUtilsITCase.java#L66
    However, this test is under the assumption that there is only one task 
running, which is why it did not fail in the first place.
    If there are multiple tasks, the resulting unique id is not deterministic 
for a single dataset element. I would implement a test, that creates a dataset, 
applies the `zipWithUniqueId` method, calls `distinct(0)` on the created ids 
and checks the number of resulting elements (must be equal to the input 
dataset). Would this be sufficient?
    Furthermore, the current test cases for `DataSetUtils` assume a resulting 
dataset as string and check this after each test run. My proposed test would 
not fit in that scheme. Should I create a new test case class for this method?
    
    @StephanEwen I wanted to do this, but static doesn't work with anonymous 
classes. However, I can declare the UDF as a private inner class (didn't want 
to change much code).
    @HuangWHWHW the `log2` method already existed and in the issue, I proposed 
to rename it. Maybe `getBitSize(long value)`? As for the "proof": if each task 
id is smaller than the total number of parallel tasks t, its bit representation 
is also smaller than the bit representation of t. Thus, when we shift the 
counter by the number of bits of t, there cannot be a collision for different 
task ids


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to