[GitHub] flink pull request: [FLINK-1901] [core] Create sample operator for...

tillrohrmann Fri, 31 Jul 2015 09:26:02 -0700

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-126740651
  
    Thanks for your contribution @ChengXiangLi. The code is really well tested 
and well structured. Great work :-)
    
    I had only some minor comments. There is however one thing I'm not so sure 
about. With the current implementation, all parallel tasks of the sampling 
operator will get the same random generator/seed value. Thus, every node will 
generate the same sequence of random numbers. I think this can have a negative 
influence on the sampling. What we could do is to use 
`RichMapPartitionFunction` instead of the `MapPartitionFunction`. With the rich 
function, we either have access to the subtask index, given by 
`getRuntimeContext().getIndexOfThisSubtask()`,  which we could use to modify 
the initial seed or we generate the random number generator in the `open` 
method (this method is executed on the TaskManager). Assuming that the clocks 
are not completely synchronized and that the individual tasks will be 
instantiated not at the same time, this could give us less correlated random 
number sequences. What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1901] [core] Create sample operator for...

Reply via email to