Github user ChengXiangLi commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-129409569
  
    Thanks for the input, @tillrohrmann and @sachingoel0101 . I would like to 
implement the fixed size sampling with only one pass through source dataset, 
since while user try to sample a dataset, the dataset should be quite large in 
most cases, pass through the dataset multi times would add much more effort. In 
my solution, the basic idea of fixed size sample in distributed stream is that: 
generate a random number for each input elements as its weight, select top K 
elements with max weight, as the weights are generated randomly, so the 
selected top K elements are selected randomly. You can see more detail 
information in the code and javadoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to