Github user fhueske commented on the pull request:

    https://github.com/apache/flink/pull/801#issuecomment-110276236
  
    Hi Andra,
    
    you approach basically follows @rmetzger 's suggestion which is necessary 
if you need sequential IDs. However, it comes at the cost of doing two passes 
over the data and temping the data after the first map because you need to wait 
for the count before you can assign IDs. Temping data means writing to and 
reading from disk if you process a lot of data.
    My approach won't give sequential IDs but works in a pipelined fashion with 
a single Mapper and without temping. For each parallel task, you create an ID 
based on its index and a counter that starts at 0. These are the two components 
from which a record ID is created by shifting the counter by the number of bits 
you need for the task ID which is log2 of the number of tasks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to