Github user fhueske commented on the pull request:
https://github.com/apache/flink/pull/801#issuecomment-110276236
Hi Andra,
you approach basically follows @rmetzger 's suggestion which is necessary
if you need sequential IDs. However, it comes at the cost of doing two passes
over the data and temping the data after the first map because you need to wait
for the count before you can assign IDs. Temping data means writing to and
reading from disk if you process a lot of data.
My approach won't give sequential IDs but works in a pipelined fashion with
a single Mapper and without temping. For each parallel task, you create an ID
based on its index and a counter that starts at 0. These are the two components
from which a record ID is created by shifting the counter by the number of bits
you need for the task ID which is log2 of the number of tasks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---