[
https://issues.apache.org/jira/browse/FLINK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736046#comment-14736046
]
ASF GitHub Bot commented on FLINK-2533:
---------------------------------------
Github user ChengXiangLi commented on a diff in the pull request:
https://github.com/apache/flink/pull/1110#discussion_r39002941
--- Diff:
flink-java/src/main/java/org/apache/flink/api/java/sampling/BernoulliSampler.java
---
@@ -102,15 +106,31 @@ public T next() {
}
private T getNextSampledElement() {
- while (input.hasNext()) {
- T element = input.next();
-
- if (random.nextDouble() <= fraction) {
+ if (fraction <= THRESHOLD) {
+ double rand = random.nextDouble();
+ double u = Math.max(rand, EPSILON);
+ int gap = (int) (Math.log(u) /
Math.log(1 - fraction));
+ int elementCount = 0;
+ if (input.hasNext()) {
+ T element = input.next();
+ while (input.hasNext() &&
elementCount < gap) {
+ element = input.next();
+ elementCount++;
+ }
return element;
+ } else {
+ return null;
}
- }
+ }else {
--- End diff --
Format: add a blank between right brace and "else".
> Gap based random sample optimization
> ------------------------------------
>
> Key: FLINK-2533
> URL: https://issues.apache.org/jira/browse/FLINK-2533
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Chengxiang Li
> Priority: Minor
>
> For random sampler with fraction, like BernoulliSampler and PoissonSampler,
> Gap based random sampler could exploit O(k) sample implementation instead of
> previous O\(n\) sample implementation, it should perform better while sample
> fraction is very small. [This
> blog|http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/]
> describes more detail about gap based random sampler.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)