[ 
https://issues.apache.org/jira/browse/FLINK-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721272#comment-14721272
 ] 

ASF GitHub Bot commented on FLINK-2590:
---------------------------------------

GitHub user s1ck opened a pull request:

    https://github.com/apache/flink/pull/1075

    [FLINK-2590] fixing DataSetUtils.zipWithUniqueId()

    * modified algorithm as explained in the issue
    * updated method documentation

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/s1ck/flink FLINK-2590

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1075.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1075
    
----
commit ab362b5b5ae390449972cc03f398d75c0231cb3c
Author: Martin Junghanns <[email protected]>
Date:   2015-08-29T20:51:19Z

    [FLINK-2590] fixing DataSetUtils.zipWithUniqueId()
    
    * modified algorithm as explained in the issue
    * updated method documentation

----


> DataSetUtils.zipWithUniqueID creates duplicate IDs
> --------------------------------------------------
>
>                 Key: FLINK-2590
>                 URL: https://issues.apache.org/jira/browse/FLINK-2590
>             Project: Flink
>          Issue Type: Bug
>          Components: Java API, Scala API
>    Affects Versions: 0.10, master
>            Reporter: Martin Junghanns
>            Assignee: Martin Junghanns
>            Priority: Minor
>
> The function creates IDs using the following code:
> {code:java}
> shifter = log2(numberOfParallelSubtasks)
> id = counter << shifter + taskId;
> {code}
> As the binary function + is executed before the bitshift <<, this results in 
> cases where different tasks create the same ID. It essentially calculates
> {code}
> counter*2^(shifter+taskId)
> {code}
> which is 0 for counter = 0 and all values of shifter and taskID.
> Consider the following example.
> numberOfParallelSubtaks = 8 
> shifter = log2(8) = 4 (maybe rename the function?)
> produces:
> {code}
> start: 1, shifter: 4 taskId: 4 label: 256
> start: 2, shifter: 4 taskId: 3 label: 256
> start: 4, shifter: 4 taskId: 2 label: 256
> {code}
> I would suggest the following:
> {code}
> counter*2^(shifter)+taskId
> {code}
> which in code is equivalent to
> {code}
> shifter = log2(numberOfParallelSubtasks);
> id = (counter << shifter) + taskId;
> {code}
> and for our example produces:
> {code}
> start: 1, shifter: 4 taskId: 4 label: 20
> start: 2, shifter: 4 taskId: 3 label: 35
> start: 4, shifter: 4 taskId: 2 label: 66
> {code}
> So we move the counter to the left and add the task id. As there is space for 
> 2^shifter numbers, this prevents collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to