Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/949#issuecomment-126740651
Thanks for your contribution @ChengXiangLi. The code is really well tested
and well structured. Great work :-)
I had only some minor comments. There is however one thing I'm not so sure
about. With the current implementation, all parallel tasks of the sampling
operator will get the same random generator/seed value. Thus, every node will
generate the same sequence of random numbers. I think this can have a negative
influence on the sampling. What we could do is to use
`RichMapPartitionFunction` instead of the `MapPartitionFunction`. With the rich
function, we either have access to the subtask index, given by
`getRuntimeContext().getIndexOfThisSubtask()`, which we could use to modify
the initial seed or we generate the random number generator in the `open`
method (this method is executed on the TaskManager). Assuming that the clocks
are not completely synchronized and that the individual tasks will be
instantiated not at the same time, this could give us less correlated random
number sequences. What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---