[ 
https://issues.apache.org/jira/browse/FLINK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736054#comment-14736054
 ] 

ASF GitHub Bot commented on FLINK-2533:
---------------------------------------

Github user ChengXiangLi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1110#discussion_r39003293
  
    --- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/sampling/BernoulliSampler.java
 ---
    @@ -102,15 +106,31 @@ public T next() {
                        }
     
                        private T getNextSampledElement() {
    -                           while (input.hasNext()) {
    -                                   T element = input.next();
    -
    -                                   if (random.nextDouble() <= fraction) {
    +                           if (fraction <= THRESHOLD) {
    +                                   double rand = random.nextDouble();
    +                                   double u = Math.max(rand, EPSILON);
    +                                   int gap = (int) (Math.log(u) / 
Math.log(1 - fraction));
    +                                   int elementCount = 0;
    +                                   if (input.hasNext()) {
    +                                           T element = input.next();
    +                                           while (input.hasNext() && 
elementCount < gap) {
    --- End diff --
    
    While gap is larger than the input remain elements number, this should 
return null, right? but the implementation here return the last input element.


> Gap based random sample optimization
> ------------------------------------
>
>                 Key: FLINK-2533
>                 URL: https://issues.apache.org/jira/browse/FLINK-2533
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chengxiang Li
>            Priority: Minor
>
> For random sampler with fraction, like BernoulliSampler and PoissonSampler, 
> Gap based random sampler could exploit O(k) sample implementation instead of 
> previous O\(n\) sample implementation, it should perform better while sample 
> fraction is very small. [This 
> blog|http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/]
>  describes more detail about gap based random sampler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to